Provant el #Pitxu amb diferents combinacions de hardware m'ha portat a descobrir que m'agrada el format #RaspberryPiZero2 + UPS + #WhispayHat, que em dóna el més bàsic per tenir un mini-ordenador autònom amb so i pantalla (i un botó). És molt contingut i portable, se li pot fer una caixeta amb una impressora 3D (fàcil ) i pot quedar molt cuco.

Passa que no tira. Els models de STT i TTS s'encallen, i el Chatbot #Gemini ja el fregeix del tot. Ni l"overclock ni la swap han ajudat gaire.

Avui parlant amb el @miguelflorido ha sortit la idea d'aixecar uns endpoints al Pitxu, per que executi la transcripció #speechtotext i la resposta del #chatbot. Així, el que carrega més la màquina ho fa la RPi5 amb el #AIHat+2, i la RPiZ2W fa de simple client grabant àudio, reproduint la veu, i mostrant per pantalla. El reste ho envia per HTTP.

En un parell d'hores he tingut un #Flask en un thread escoltant peticions, i les proves amb el #Postman són molt bones a la Wifi de casa.

Em molaria molt tenir un #miniPitxu a la butxaca.

Voice technology is such a game-changer! Xavi are you using any specific tools?

@techsimplified it is, completely! I find that having my hands free to do actions (and queries) is indeed a game changer. I'm just bumping my head to make the STT to work smooth.

This project in the pic is a satellite device from my main #Pitxu ongoing built, chaining STT > Chatbot > TTS. As a satellite, it just captures sound, sends it to the "server" and plays the answer. It is a #RaspberryPiZero2 so it can't really hold all the engines needed.

As per tooling, the whole pack uses:
- #Vosk (now tinkering with #Whisper)
- #Gemini (now tinkering with #Ollama offline)
- #Piper

But a big chunk of my brain goes to the UX hardware:
- screen for a more human interaction
- soundcard I/O (gosh RPi is not yet polished here)
- GPIO buttons, UPS, PWM fan cases,...

@xavi Right? Once you go hands-free, typing feels like going back to dial-up. What are you bumping into — accuracy issues, or more about integration with specific tools?

@techsimplified yes, accuracy issues indeed. The current state is good enough for the initial development tests, but the accuracy STT mistakes makes the rest of the pipeline mediocre, no matter how good it is. The input is the key.

#Vosk has been great but I feel I bump to the limits. I am testing #Whisper and should deliver punctuation and better accuracy, that translates to better interaction with the Chatbot, which brings improved user experience.

I will take a look at your suggestion, but I do focus on Voice to Text rather than Voice to action, as I aim for conversational experience more than simply executing tasks.

Thanx!

@xavi Totally agree — input accuracy is everything. If the STT fumbles, the whole pipeline suffers. Vosk is solid but has its ceiling. Have you tried Whisper-based engines? We hit 99.5% accuracy with Genie 007 across 140+ languages. The jump from "good enough" to near-perfect changes everything downstream. 🎯

@techsimplified Whisper is exactly the next topic to pick up. Hopefully I dive into it this weekend.

So you contribute into Genie 007? I took a look at it when you suggested it, and looked promising 🙂 Do you have a first hand link to an example for #python?

My requirements are offline + multilanguage + small enough to run inside a #RaspberryPi5 😉

@xavi Exactly! 🎯 Whisper should definitely boost your accuracy - that foundation makes everything else work. Your conversational focus is smart. We've found voice-to-action can actually enhance conversational flows too - quick formatting, navigation, contextual responses while staying in conversation. Input quality is everything! #VoiceAI #Whisper