Lemonade by AMD: a fast and open source local LLM server using GPU and NPU

https://lemonade-server.ai

Lemonade: Local AI for Text, Images, and Speech

Note that the NPU models/kernels this uses are proprietary and not available as open source. It would be nice to develop more open support for this hardware.
Are they? The docs say "You can also register any Hugging Face model into your Lemonade Server with the advanced pull command options"
That won't give you NPU support, which relies on https://github.com/FastFlowLM/FastFlowLM . And that says "NPU-accelerated kernels are proprietary binaries", not open source.
GitHub - FastFlowLM/FastFlowLM: Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but purpose-built and deeply optimized for the AMD NPUs.

Run LLMs on AMD Ryzen™ AI NPUs in minutes. Just like Ollama - but purpose-built and deeply optimized for the AMD NPUs. - FastFlowLM/FastFlowLM

GitHub
I bought one of their machines to play around with under the expectation that I may never be able to use the NPU for models. But I am still angry to read this anyway.
AMD/Xilinx's software support for the NPU is fully open, it's only FFLM's models that are proprietary. See https://github.com/amd/iron https://github.com/Xilinx/mlir-aie https://github.com/amd/RyzenAI-SW/ . It would be nice to explore whether one can simply develop kernels for these NPU's using Vulkan Compute and drive them that way; that would provide the closest unification with the existing cross-platform support for GPU's.
Is... is this named because they have a lemon they're trying to make the most of?
If life keeps giving it them, they should instead invent a combustible lemon.
Do they know who you are? They're the guys who are going to blow your house up ... with the lemons.
On an unrelated note, do you think this software supports running models from a CD?...
I think saying "L-L-M" sounds kind of like "lemon," so this is an LLM-aid (sounds like lemonade).
so obvious and yet I didn't connect the dots. thank you
Lemonsqueeze was considered too violent
If you run it in a cluster, does it become a Lemon Party?

Feels like this is sitting somewhere between Ollama and something like LM Studio, but with a stronger focus on being a unified “runtime” rather than just model serving.

The interesting part to me isn’t just local inference, but how much orchestration it’s trying to handle (text, image, audio, etc). That’s usually where things get messy when running models locally.

Curious how much of this is actually abstraction vs just bundling multiple tools together. Also wondering if the AMD/NPU optimizations end up making it less portable compared to something like Ollama in practice.

It bundles tools, model selection, and overall management.

It’s portable in the sense it will install on any of the supported OS using CPU or vulkan backends. But it only supports out of the box ROCM builds and AMD NPUs. There is a way to override which llama.cpp version it uses if you want to run it on CUDA, but that adds more overhead to manage.

If you have an AMD machine and want to run local models with minimal headache…it’s really the easiest method.

This runs on my NAS, handles my home assistant setup.

I have a strix halo and another server running various CUDA cards I manage manually by updating to bleeding edge versions of llama.cpp or vllm.

I have been using lemonade for nearly a year already. On Strix Halo I am using nothing else - although kyuz0's toolboxes are also nice (https://kyuz0.github.io/amd-strix-halo-toolboxes/)

Nowadays you get TTS, STT, text & image generation and image editing should also be possible. Besides being able to run via rocm, vulkan or on CPU, GPU and NPU. Quite a lot of options. They have a quite good and pragmatic pace in development. Really recommend this for AMD hardware!

Edit: OpenAI and i think nowaday ollama compatible endpoints allow me to use it in VSCode Copilot as well as i.e. Open Web UI. More options are shown in their docs.

AMD Strix Halo — Backend Benchmarks (Grid View)

Have you used it with any agents or claw? If so, which model do you run?

I have two Strix Halo devices at hand. Privately a framework desktop with 128gb and at work 64GB HP notebook. The 64GB machine can load Qwen3.5 30B-A3B, with VSCode it needs a bit of initial prompt processing to initialize all those tools I guess. But the model is fighting with the other resources that I need. So I am not really using it anymore these days, but I want to experiment on my home machine with it. I just dont work on it much right now.

Lemonade has a Web UI to set the context size and llama.cpp args, you need to set context to proper number or just to 0 so that it uses the default. If its too low, it wont work with agentic coding.

I will try some Claw app, but first need to research the field a bit. But I am using different models on Open Web UI. GPT 120B is fast, but also Qwen3.5 27B is fine.

Qwen3-Coder-Next works well on my 128GB Framework Desktop. It seems better at coding Python than Qwen3.5 35B-A3B, and it's not too much slower (43 tg/s compared to 55 tg/s at Q4).

27B is supposed to be really good but it's so slow I gave up on it (11-12 tg/s at Q4).

Been running local LLMs on my 7900 XTX for months and the ROCm experience has been... rough. The fact that AMD is backing an official inference server that handles the driver/dependency maze is huge. My biggest question is NPU support - has anyone actually gotten meaningful throughput from the Ryzen AI NPU vs just using the dGPU? In my testing the NPU was mostly a bottleneck for anything beyond tiny models.
the npu is more for power efficiency when on battery. I don't think it's a replacement for gpu.