TIL: There is an open source "Alexa replacement" project

https://libretechni.ca/post/426637

TIL: There is an open source "Alexa replacement" project - LibreTechni.ca

>As Snowden told us, video and audio recording capabilities of your devices are NSA spying vectors. OSS/Linux is a safeguard against such capabilities. The massive datacenter investments in US will be used to classify us all into a patriotic (for Israel)/Oligarchist social credit score, and every mega tech company can increase profits through NSA cooperation, and are legally obligated to cooperate with all government orders. > >Speech to text and speech automation are useful tech, though always listening state sponsored terrorists is a non-NSA targeted path for sweeping future social credit classifications of your past life. > >Some small LLMs that can be used for speech to text: https://modal.com/blog/open-source-stt [https://modal.com/blog/open-source-stt]

I mean, there are many. STTm TTS and self-hosted automation are huge in the local LLM scene.
I do wish there was a smaller LongCat model available. My current AI node has a hard 16GB VRAM limit (yay AMD UMA limitations), so 27B can't really fit. An 8B dynamically loaded model would fit, and run much better.

You can do hybrid inference of Qwen 30B omni for sure. Or Vibevoice Large (9B). Or really a huge array of models.

…The limiting factor is free time, TBH. Just sifting through the sea of models, seeing of quantization works and such is a huge timesink, especially if you are trying to load stuff with rocm.

And I am on ROCm - specifically on an 8945HS, which is advertised as a Ryzen AI APU yet is completely unsupported as a target with major issues around queuing and more complex models (although the new 7.0 betas have been promising but TheRock's flip-flopping with their Docker images has been making me go crazy...).

Ah. On an 8000 APU, to be blunt, you’re likely better off with Vulkan + whatever GGML supports. Last I checked, TG is faster and prompt processing is close to rocm.

…And yeah, that was total misadvertisement on AMD’s part. They’ve completely diluted the term kinda like TV makers did with ‘HDR’

The thing is, if AMD actually added proper support for it, given it has a somewhat powerful NPU as well... For the total TDP of the package it's still one of the best perf per watt APU, just the damn software support isn't there.

Feckin AMD.

The IGP is more powerful than the NPU on these things anyway. The NPU us more for ‘background’ tasks, like Teams audio processing or whatever its used for on Windows.

Yeah, in hindsight, AMD should have tasked (and still should task) a few devs in popular projects (and pushed NPU support harder), but GGML support is good these days. It’s gonna be pretty close to RAM speed-bound for text generation

Aye, I was actually hoping to use the NPU for TTS/STT while keeping the LLM systems GPU bound.

It still uses memory bandwidth, unfortunately. There’s no way around that, though NPU STT/TTS would still be neat.

…Also, generally, STT responses can’t be streamed, so you mind as well use the iGPU anyway. TTS can be chunked I guess, but do the major implementations do that?

Piper does chunking for TTS, and could utilise the NPU with the right drivers.

And the idea of running them on the NPU is not about memory usage but hardware capacity/parallelism. Although I guess it would have some benefits when I don't have to constantly load/unload GPU models.

Yeah… Even if the LLM is RAM speed constrained, simply using another device to not to interrupt it would be good.

Honestly AMD’s software dev efforts are baffling. They’ve sicked a few on libraries precisely no-one uses, like this: github.com/amd/Quark

While ignoring issues holding back entire sectors (like broken flash-attention) with devs screaming about it at the top of their lungs.

GitHub - amd/Quark

Contribute to amd/Quark development by creating an account on GitHub.

GitHub

Oh, I forgot!

You should check out Lemonade:

github.com/lemonade-sdk/lemonade

It’s supports Ryzen NPUs via 2 different runtimes… though apparently not the 8000 series yet?

GitHub - lemonade-sdk/lemonade: Lemonade helps users discover and run local AI apps by serving optimized LLMs right from their own GPUs and NPUs. Join our discord: https://discord.gg/5xXzkMu8Zk

Lemonade helps users discover and run local AI apps by serving optimized LLMs right from their own GPUs and NPUs. Join our discord: https://discord.gg/5xXzkMu8Zk - lemonade-sdk/lemonade

GitHub
I've actually been eyeing lemonade, but the lack of Dockerisation is still an issue... guess I'll just DIY it at one point.

It’s all C++ now, so it doesn’t really need docker!

You might consider Arch (dockerless) ROCM as well; it looks like 7.1 is in the staging repo right now.

Due to the fact I am running UnRaid on the node in question, I kinda do need Docker. I want to avoid messing with the core OS as much as possible, plus a Dockerised app is always easier to restore.