Ollama is now powered by MLX on Apple Silicon in preview
Ollama is now powered by MLX on Apple Silicon in preview
Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.
On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.
True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.
Code tools that free my time up is very nice.
It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.
I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?
I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.
Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.
We are 100% there already. In browser.
the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s
Local is here.
I have journaled digitally for the last 5 years with this expectation.
Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.
It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.
I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.
I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.
When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.
Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.
Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.