What a difference a day makes: Was optimistic about Foundation Models yesterday, and today I think I know why they didn't ship the improved Siri. The local model really is pretty thick. I thought it would be capable of stringing together tool calls in a logical way, and sometimes it is, but other times it fails to understand. Exactly the same prompt will work one time, and fail the next. Sounds like what Apple was saying about the new Siri.

@drewmccormack Honestly, I’m not surprised: I’ve been playing around with various local models on my Mac and even 16÷24B parameters versions of well respected models are hugely inconsistent and basically a general disappointment over Big AI.

I appreciate their local models are highly specialised, but 3B parameters really looked like too little to do versatile things.

@drewmccormack Interesting. Maybe it helps out to get a better idea of the limits but someone made an example app to chat with the Foundation model.

https://github.com/PallavAg/Apple-Intelligence-Chat

GitHub - PallavAg/Apple-Intelligence-Chat: Chat with Apple Intelligence using the Foundation Models Framework

Chat with Apple Intelligence using the Foundation Models Framework - PallavAg/Apple-Intelligence-Chat

GitHub
@obrhoff I am chatting with it via my app. It surprises me that sometimes it will work fine, using multiple tools to get an answer, and the next it will say “I have no access to your data” or “I have no tool”. I wonder if you could trick it, by asking it to make a plan with the tools available, and in next step, tell it to execute the plan. Like the “thinking” online llms.
@drewmccormack @obrhoff Before the advent of “reasoning models,” chain-of-thought was absolutely the best way to get the best and most consistent performance out of LLMs. I have many such scripts still lying around. The caveat was that it took a bit of regex and really stern wording to get consistent thinking-plus-results in a way that you could parse, but the Generable models should solve that…
@kyle @obrhoff Can you get the LLM to do the chain-of-thought itself? Eg. give it the prompt from the user, and then an instruction like “Make a plan”, and then another “Execute the plan”? Or do you have to basically generate a plan yourself somehow for the model to follow? (Seems to defeat the purpose.)

@drewmccormack @obrhoff Yes, you have it do it itself. It's all the same magic trick: telling it to "think" just helps to… influence the conditional output distribution toward tokens that reflect a good answer. LLMs really are just "predicting the next token," so by stuffing the context with relevant information–or having it stuff the context itself–we increase the chances of those next tokens being on-rails.

Caveat: may be less effective with the small context window of Foundation Models.

@drewmccormack @obrhoff Here is an example chain-of-thought structure I have in one of my scripts:
@kyle Very useful. Thanks! I'm going to try this.
@kyle @obrhoff So I set this up using a @ Generable type, and giving instructions to do the “thinking" first. It actually seems to improve things. What is weird though is that I see calls to tools made before the thinking begins. Something I don't understand is how the LLM knows about results from tools. They are generated dynamically. Surely it would be better if tool results were static, so it could know what it will give. Or do you simply have to add details in the description of the tool?
@drewmccormack @obrhoff Hmm, I’m going to guess whatever behavior that is relates to the system prompt that is hiding the models. But I haven’t actually experimented with Foundation Models at all, I’m not a beta boy. You are in uncharted waters :)
@drewmccormack have you tried adjusting the temperature? Maybe that would make its behavior more consistent?
@calicoding Yes, I will certainly play with that, but not sure it will help in the sense that, yes, I will get more consistent answers, but they may be exactly the wrong answers. Sometimes it is right, sometimes wrong, and temperature could presumably give 100% wrong.
@drewmccormack @calicoding Is there a way you could break down your call into a sequence of simpler calls?
@stevex @calicoding I did try using a “thinking” property, telling it to make a plan and putting it in that before answering. I don’t know if it helped really. That is similar to the idea of breaking down a request into multiple steps.
greedy | Apple Developer Documentation

A sampling mode that always chooses the most likely token.

Apple Developer Documentation
@alpennec I’ll look into it. What is the difference between this and a low temperature?
@drewmccormack I havent touched enough to know unfortunately.
@alpennec @drewmccormack AFAIK it’s exactly the same as 0 temperature. It does make things deterministic (within any given model) but it will make certain use cases really boring.
@brandonhorst @alpennec And I assume it doesn’t guarantee that it will be any more “right”, correct? Or is the zero temp solution the most likely to be right? I guess it is.
@drewmccormack @alpennec It’ll be more “probable” haha. “Right” is in the eye of the beholder with these sorts of things
@brandonhorst @alpennec In my case, it should just do the logical thing. Given my use case is mostly search and automation, a low temp probably makes sense. Personality is not very important here.
@drewmccormack @brandonhorst the question is can we always get the same result between sessions? Is it stable?
@alpennec @drewmccormack Yes, that is definitely true, until they update the model itself https://developer.apple.com/videos/play/wwdc2025/301?time=380
Deep dive into the Foundation Models framework - WWDC25 - Videos - Apple Developer

Level up with the Foundation Models framework. Learn how guided generation works under the hood, and use guides, regexes, and generation...

Apple Developer
@brandonhorst @drewmccormack so if I provide the same input between multiple app launches, the result provided will be the same?
@brandonhorst @drewmccormack what does "IIUS" stands for please?
@alpennec @drewmccormack Lol I meant to say “IIUC”, if I understand correctly, sorry
@brandonhorst @drewmccormack it seems the session needs to be in the same state. Does that mean a fresh new session with the same instructions only can provide the same output?
@alpennec @brandonhorst My understanding is that it will be deterministic based on the whole conversation. So a particular set of prompts will lead to the same response.
@drewmccormack @brandonhorst even if the responses to the same prompts differ? This would mean the session transcript could contain the same instructions and the same prompts but different responses.
@alpennec @brandonhorst The responses are also part of the history. If you start a chain with a specific prompt, and continue exactly as last time, you should get the same responses each time in the .greedy case. (Although, I have a feeling temperature would have to be 0, because that might relate to a different part of the LLM.)
@drewmccormack yes, as responses are part of the transcript, they would all need to be .greedy to get a stable session state. If one is not .greedy, then it breaks all the session I guess.