Mastodawn

What a difference a day makes: Was optimistic about Foundation Models yesterday, and today I think I know why they didn't ship the improved Siri. The local model really is pretty thick. I thought it would be capable of stringing together tool calls in a logical way, and sometimes it is, but other times it fails to understand. Exactly the same prompt will work one time, and fail the next. Sounds like what Apple was saying about the new Siri.

Cesare Forelli Jun 17

@drewmccormack Honestly, I’m not surprised: I’ve been playing around with various local models on my Mac and even 16÷24B parameters versions of well respected models are hugely inconsistent and basically a general disappointment over Big AI.

I appreciate their local models are highly specialised, but 3B parameters really looked like too little to do versatile things.

@drewmccormack Interesting. Maybe it helps out to get a better idea of the limits but someone made an example app to chat with the Foundation model.

https://github.com/PallavAg/Apple-Intelligence-Chat

GitHub - PallavAg/Apple-Intelligence-Chat: Chat with Apple Intelligence using the Foundation Models Framework

Chat with Apple Intelligence using the Foundation Models Framework - PallavAg/Apple-Intelligence-Chat

GitHub

Drew McCormack Jun 17

@obrhoff I am chatting with it via my app. It surprises me that sometimes it will work fine, using multiple tools to get an answer, and the next it will say “I have no access to your data” or “I have no tool”. I wonder if you could trick it, by asking it to make a plan with the tools available, and in next step, tell it to execute the plan. Like the “thinking” online llms.

Kyle Hughes Jun 17

@drewmccormack @obrhoff Before the advent of “reasoning models,” chain-of-thought was absolutely the best way to get the best and most consistent performance out of LLMs. I have many such scripts still lying around. The caveat was that it took a bit of regex and really stern wording to get consistent thinking-plus-results in a way that you could parse, but the Generable models should solve that…

Drew McCormack Jun 17

@kyle @obrhoff Can you get the LLM to do the chain-of-thought itself? Eg. give it the prompt from the user, and then an instruction like “Make a plan”, and then another “Execute the plan”? Or do you have to basically generate a plan yourself somehow for the model to follow? (Seems to defeat the purpose.)

Kyle Hughes Jun 17

@drewmccormack @obrhoff Yes, you have it do it itself. It's all the same magic trick: telling it to "think" just helps to… influence the conditional output distribution toward tokens that reflect a good answer. LLMs really are just "predicting the next token," so by stuffing the context with relevant information–or having it stuff the context itself–we increase the chances of those next tokens being on-rails.

Caveat: may be less effective with the small context window of Foundation Models.

Kyle Hughes Jun 17

@drewmccormack @obrhoff Here is an example chain-of-thought structure I have in one of my scripts:

Drew McCormack Jun 17

@kyle Very useful. Thanks! I'm going to try this.

Drew McCormack Jun 18

@kyle @obrhoff So I set this up using a @ Generable type, and giving instructions to do the “thinking" first. It actually seems to improve things. What is weird though is that I see calls to tools made before the thinking begins. Something I don't understand is how the LLM knows about results from tools. They are generated dynamically. Surely it would be better if tool results were static, so it could know what it will give. Or do you simply have to add details in the description of the tool?

Kyle Hughes Jun 18

@drewmccormack @obrhoff Hmm, I’m going to guess whatever behavior that is relates to the system prompt that is hiding the models. But I haven’t actually experimented with Foundation Models at all, I’m not a beta boy. You are in uncharted waters :)

calicoding Jun 17

@drewmccormack have you tried adjusting the temperature? Maybe that would make its behavior more consistent?

Drew McCormack Jun 17

@calicoding Yes, I will certainly play with that, but not sure it will help in the sense that, yes, I will get more consistent answers, but they may be exactly the wrong answers. Sometimes it is right, sometimes wrong, and temperature could presumably give 100% wrong.

Steve Tibbett Jun 22

@drewmccormack @calicoding Is there a way you could break down your call into a sequence of simpler calls?

Drew McCormack Jun 22

@stevex @calicoding I did try using a “thinking” property, telling it to make a plan and putting it in that before answering. I don’t know if it helped really. That is similar to the idea of breaking down a request into multiple steps.

Axel Le Pennec Jun 17

@drewmccormack have you tried to use GenerationOptions(sampling: .greedy)?

https://developer.apple.com/documentation/foundationmodels/generationoptions/samplingmode/greedy

greedy | Apple Developer Documentation

A sampling mode that always chooses the most likely token.

Apple Developer Documentation

Drew McCormack Jun 17

@alpennec I’ll look into it. What is the difference between this and a low temperature?

Axel Le Pennec Jun 17

@drewmccormack I havent touched enough to know unfortunately.

Brandon Horst Jun 17

@alpennec @drewmccormack AFAIK it’s exactly the same as 0 temperature. It does make things deterministic (within any given model) but it will make certain use cases really boring.

Drew McCormack Jun 17

@brandonhorst @alpennec And I assume it doesn’t guarantee that it will be any more “right”, correct? Or is the zero temp solution the most likely to be right? I guess it is.

Brandon Horst Jun 17

@drewmccormack @alpennec It’ll be more “probable” haha. “Right” is in the eye of the beholder with these sorts of things

Drew McCormack Jun 17

@brandonhorst @alpennec In my case, it should just do the logical thing. Given my use case is mostly search and automation, a low temp probably makes sense. Personality is not very important here.

Axel Le Pennec Jun 17

@drewmccormack @brandonhorst the question is can we always get the same result between sessions? Is it stable?

Brandon Horst Jun 17

@alpennec @drewmccormack Yes, that is definitely true, until they update the model itself https://developer.apple.com/videos/play/wwdc2025/301?time=380

Deep dive into the Foundation Models framework - WWDC25 - Videos - Apple Developer

Level up with the Foundation Models framework. Learn how guided generation works under the hood, and use guides, regexes, and generation...

Apple Developer

Axel Le Pennec Jun 17

@brandonhorst @drewmccormack so if I provide the same input between multiple app launches, the result provided will be the same?

Brandon Horst Jun 17

@alpennec @drewmccormack IIUS yes

Axel Le Pennec Jun 17

@brandonhorst @drewmccormack what does "IIUS" stands for please?

Brandon Horst Jun 17

@alpennec @drewmccormack Lol I meant to say “IIUC”, if I understand correctly, sorry

Axel Le Pennec Jun 18

@brandonhorst @drewmccormack it seems the session needs to be in the same state. Does that mean a fresh new session with the same instructions only can provide the same output?

Drew McCormack Jun 18

@alpennec @brandonhorst My understanding is that it will be deterministic based on the whole conversation. So a particular set of prompts will lead to the same response.

Axel Le Pennec Jun 18

@drewmccormack @brandonhorst even if the responses to the same prompts differ? This would mean the session transcript could contain the same instructions and the same prompts but different responses.

Drew McCormack Jun 18

@alpennec @brandonhorst The responses are also part of the history. If you start a chain with a specific prompt, and continue exactly as last time, you should get the same responses each time in the .greedy case. (Although, I have a feeling temperature would have to be 0, because that might relate to a different part of the LLM.)

Axel Le Pennec Jun 18

@drewmccormack yes, as responses are part of the transcript, they would all need to be .greedy to get a stable session state. If one is not .greedy, then it breaks all the session I guess.