Build custom voice assistants
An overview of the current technologies and how to leverage Platypush to build your customized assistant.
I wrote an article a while ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and a microphone.
It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
Since I wrote that article, a few things have changed:
-
When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve worked on supporting Alexa as well. Feel free to use the
assistant.echointegration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as output. It could also experience some minor audio glitches, at least on RasbperryPi. -
Although deprecated, a new release of the Google Assistant Library has been made available to fix the segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But at least one of the best options out there to build a voice assistant will still work for a while. Those interested in building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
-
In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more state-of-art alternatives. I’ve been a long-time fan of Snowboy, which has a well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory. I’ve also experimented with Mozilla DeepSpeech and PicoVoice products, for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
-
EDIT January 2021: Unfortunately, as of Dec 31st, 2020 Snowboy has been officially shut down. The GitHub repository is still there, you can still clone it and either use the example models provided under
resources/models, train a model using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the website that could be used to browse and generate user models is no longer available. It's really a shame - the user models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work if you download and install the code from the repo.
The Case for DIY Voice Assistants
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
-
Privacy. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice interactions over a privately-owned channel through a privately-owned box.
-
Compatibility. A Google Assistant device will only work with devices that support Google Assistant. The same goes for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily, depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to interact with, and does not depend on business decisions.
-
Flexibility. Even when a device works with your assistant, you’re still bound to the features that have been agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky. In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services ( Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
-
Hardware constraints. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as long as that device has a way to communicate with the outside world. The logic to control that device should be able to run on the same network that the device belongs to.
-
Cloud vs. local processing. Most of the commercial voice assistants operate by regularly capturing streams of audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but, regardless of the technology, we should always be provided with a choice between decentralized and centralized computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or on-cloud, depending on the use case and depending on the user’s preference.
-
Scalability. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi, without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the possibility of becoming a smart device.
Overview of the voice assistant integrations
A voice assistant usually consists of two components:
- An audio recorder that captures frames from an audio input device
- A speech engine that keeps track of the current context.
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of
specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text
transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a
far higher overhead than just running hotword detection, which only has to compare the captured speech against the,
usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead
of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if
you say “Can I have a small double-shot espresso with a lot of sugar and some milk” they may return something like {"
type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}).
In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and engines. Let’s go through some of the available integrations, and evaluate their pros and cons.
Native Google Assistant library
Integrations
assistant.googleplugin (to programmatically start/stop conversations) andassistant.googlebackend (for continuous hotword detection).
Configuration
-
Create a Google project and download the
credentials.jsonfile from the Google developers console. -
Install the
google-oauthlib-tool:
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
- Authenticate to use the
assistant-sdk-prototypescope:
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
- Install Platypush with the HTTP backend and Google Assistant library support:
[sudo] pip install 'platypush[http,google-assistant-legacy]'
- Create or add the lines to
~/.config/platypush/config.yamlto enable the webserver and the assistant integration:
backend.http:
enabled: True
backend.assistant.google:
enabled: True
assistant.google:
enabled: True
- Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on
http://your-rpi:8008you should be able to see your voice interactions in real-time.
Features
- Hotword detection: YES (“Ok Google” or “Hey Google).
- Speech detection: YES (once the hotword is detected).
- Detection runs locally: NO (hotword detection [seems to] run locally, but once it's detected a channel is open with Google servers for the interaction).
Pros
-
It implements most of the features that you’d find in any Google Assistant products. That includes native support for timers, calendars, customized responses on the basis of your profile and location, native integration with the devices configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks on e.g. speech detected or conversation start/end events.
-
Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
-
Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes around 2–3% of the CPU on a RaspberryPi 4.
Cons
-
The Google Assistant library used as a backend by the integration has been deprecated by Google. It still works on most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
-
If your main goal is to operate voice-enabled services within a secure environment with no processing happening on someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and, potentially, review.
Google Assistant Push-To-Talk Integration
Integrations
assistant.google.pushtotalkplugin.
Configuration
-
Create a Google project and download the
credentials.jsonfile from the Google developers console. -
Install the
google-oauthlib-tool:
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
- Authenticate to use the
assistant-sdk-prototypescope:
export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
- Install Platypush with the HTTP backend and Google Assistant SDK support:
[sudo] pip install 'platypush[http,google-assistant]'
- Create or add the lines to
~/.config/platypush/config.yamlto enable the webserver and the assistant integration:
backend.http:
enabled: True
assistant.google.pushtotalk:
language: en-US
- Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute
Features
-
Hotword detection: NO (call
start_conversationorstop_conversationfrom your logic or from the context of a hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant). -
Speech detection: YES.
-
Detection runs locally: NO (you can customize the hotword engine and how to trigger the assistant, but once a conversation is started a channel is opened with Google servers).
Pros
-
It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or alarms).
-
Rock-solid speech detection, using the same speech model used by Google Assistant products.
-
Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses resources only when you call
start_conversation. -
It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between your mic and Google’s servers. The connection is only opened upon
start_conversation. This makes it a good option if privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press, motion sensor event or web call.
Cons
-
I’ve built this integration after the deprecation of the Google Assistant library occurred with no official alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples (
pushtotalk.py) and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to be replaced by Google. -
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
Alexa Integration
Integrations
assistant.echoplugin.
Configuration
- Install Platypush with the HTTP backend and Alexa support:
[sudo] pip install 'platypush[http,alexa]'
-
Run
alexa-auth. It will start a local web server on your machine onhttp://your-rpi:3000. Open it in your browser and authenticate with your Amazon account. A credentials file should be generated under~/.avs.json. -
Create or add the lines to your
~/.config/platypush/config.yamlto enable the webserver and the assistant integration:
backend.http:
enabled: True
assistant.echo:
enabled: True
- Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute
Features
-
Hotword detection: NO (call
start_conversationorstop_conversationfrom your logic or from the context of a hotword integration like Snowboy or PicoVoice to trigger or stop the assistant). -
Speech detection: YES (although limited: transcription of the processed audio won’t be provided).
-
Detection runs locally: NO.
Pros
-
It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t available. Also, the support for skills or media control may be limited.
-
Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
-
Good performance even on low-power devices. No hotword engine running means it uses resources only when you call start_conversation.
-
It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and Amazon’s servers. The connection is only opened upon start_conversation.
Cons
-
The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio responses. It means that text transcription, either for the request or the response, won’t be available. That limits what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
-
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
Snowboy Integration
Integrations
assistant.snowboybackend.
Configuration
- Install Platypush with the HTTP backend and Snowboy support:
[sudo] pip install 'platypush[http,snowboy]'
-
Choose your hotword model(s). Some are available under
SNOWBOY_INSTALL_DIR/resources/models. Otherwise, you can train or download models from the Snowboy website. -
Create or add the lines to your
~/.config/platypush/config.yamlto enable the webserver and the assistant integration:
backend.http:
enabled: True
backend.assistant.snowboy:
audio_gain: 1.2
models:
# Trigger the Google assistant in Italian when I say "computer"
computer:
voice_model_file: ~/models/computer.umdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: it-IT
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger the Google assistant in English when I say "OK Google"
ok_google:
voice_model_file: ~/models/OK Google.pmdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger Alexa when I say "Alexa"
alexa:
voice_model_file: ~/models/Alexa.pmdl
assistant_plugin: assistant.echo
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.5
- Start Platypush. Say the hotword associated with one of your models, check on the logs that the
HotwordDetectedEventis triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly started.
Features
- Hotword detection: YES.
- Speech detection: NO.
- Detection runs locally: YES.
Pros
-
I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning. You can download any hotword models for free from their website, provided that you record three audio samples of you saying that word in order to help improve the model. You can also create your custom hotword model, and if enough people are interested in using it then they’ll contribute with their samples, and the model will become more robust over time. I believe that more machine learning projects out there could really benefit from this “use it for free as long as you help improve the model” paradigm.
-
Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make a multi-language and multi-hotword voice assistant.
-
Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never exceeded 20–25%.
-
The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection to run and no data exchanged with any cloud.
Cons
- Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up, the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who aren’t native English speakers).
Mozilla DeepSpeech
Integrations
stt.deepspeechplugin andstt.deepspeechbackend (for continuous detection).
Configuration
- Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that gets installed:
[sudo] pip install 'platypush[http,deepspeech]'
- Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while depending on your connection:
export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite
mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
- Create or add the lines to your
~/.config/platypush/config.yamlto enable the webserver and the DeepSpeech integration:
backend.http:
enabled: True
stt.deepspeech:
model_file: ~/models/output_graph.pbmm
lm_file: ~/models/lm.binary
trie_file: ~/models/trie
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
conversation_timeout: 5
backend.stt.deepspeech:
enabled: True
- Start Platypush. Speech detection will start running on startup.
SpeechDetectedEventswill be triggered when you talk.HotwordDetectedEventswill be triggered when you say one of the configured hotwords.ConversationDetectedEventswill be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the continuous detection and only start it programmatically by callingstt.deepspeech.start_detectionandstt.deepspeech.stop_detection. You can also use it to perform offline speech transcription from audio files:
curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"stt.deepspeech.detect",
"args": {
"audio_file": "~/audio.wav"
}
}' http://your-rpi:8008/execute
{
"type":"response",
"target":"http",
"response": {
"errors":[],
"output": {
"speech": "This is a test"
}
}
}
Features
- Hotword detection: YES.
- Speech detection: YES.
- Detection runs locally: YES.
Pros
-
I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version 0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can easily extend the Tensorflow model by training it with your own samples.
-
Speech-to-text transcription of audio files can be a very useful feature.
Cons
-
DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on less powerful machines.
-
DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
-
DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.” “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text transcription purposes but, in such ambiguous cases, it lacks some semantic context.
-
Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the speech detection part.
PicoVoice
PicoVoice is a very promising company that has released several products for performing voice detection on-device. Among them:
- Porcupine, a hotword engine.
- Leopard, a speech-to-text offline transcription engine.
- Cheetah, a speech-to-text engine for real-time applications.
- Rhino, a speech-to-intent engine.
So far, Platypush provides integrations with Porcupine and Cheetah.
Integrations
-
Hotword engine:
stt.picovoice.hotwordplugin andstt.picovoice.hotwordbackend (for continuous detection). -
Speech engine:
stt.picovoice.speechplugin andstt.picovoice.speechbackend (for continuous detection).
Configuration
- Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:
[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
- Create or add the lines to your
~/.config/platypush/config.yamlto enable the webserver and the DeepSpeech integration:
stt.picovoice.hotword:
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
# Enable continuous hotword detection
backend.stt.picovoice.hotword:
enabled: True
# Enable continuous speech detection
# backend.stt.picovoice.speech:
# enabled: True
# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
if:
type: platypush.message.event.stt.HotwordDetectedEvent
then:
# Start a timer that stops the detection in 10 seconds
- action: utils.set_timeout
args:
seconds: 10
name: StopSpeechDetection
actions:
- action: stt.picovoice.speech.stop_detection
- action: stt.picovoice.speech.start_detection
- Start Platypush and enjoy your on-device voice assistant.
Features
- Hotword detection: YES.
- Speech detection: YES.
- Detection runs locally: YES.
Pros
- When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on older models of RaspberryPi.
Cons
-
While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into how they’ve solved the problem.
-
Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any chance to extend the model or use a different model, is only possible through a commercial license. While I understand their point and their business model, I’d have been super-happy to just pay for a license through a more friendly process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you” paradigm.
-
Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.
Conclusions
The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice at all. But at least some solutions are emerging to bring speech detection to all devices.
I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice integrations in the same product — and especially having voice integrations that expose all the same API and generate the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech recognition from the business logic that can be run by voice commands.
Check out my previous article to learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop events.
To summarize my findings so far:
-
Use the native Google Assistant integration if you want to have a full Google experience, and if you’re ok with Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant library won’t work anymore.
-
Use the Google push-to-talk integration if you only want to have the assistant, without hotword detection, or you want your assistant to be triggered by alternative hotwords.
-
Use the Alexa integration if you already have an Amazon-powered ecosystem and you’re ok with having less flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
-
Use Snowboy if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not be that accurate.
-
Use Mozilla DeepSpeech if you want a fully on-device open-source engine powered by a robust Tensorflow model, even if it takes more CPU load and a bit more latency.
-
Use PicoVoice solutions if you want a full voice solution that runs on-device and it’s both accurate and performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.
Let me know your thoughts on these solutions and your experience with these integrations!
Reactions
How to interact with this page
Webmentions
To interact via Webmentions, send an activity that references this URL from a platform that supports Webmentions, such as Lemmy, WordPress with Webmention plugins, or any IndieWeb-compatible site.
ActivityPub
- Follow @blog@platypush.tech on your ActivityPub platform (e.g. Mastodon, Misskey, Pleroma, Lemmy).
- Mention @blog@platypush.tech in a post to feature on the Guestbook.
- Search for this URL on your instance to find and interact with the post.
- Like, boost, quote, or reply to the post to feature your activity here.
Those who have followed me for a while know of my personal obsession with self-built voice assistants.
My experiments over the years can be summarized as it follows:
-
2007: Voxifera, my very first attempt at building a primitive voice assistant using Hidden Markov models. Definitely not good for general-purpose usage, but good enough in 2007 to distinguish between a dozen of simple voice commands.
-
2019: First voice assistant built on top of Platypush. It used the now deprecated Google Assistant Library on top of a Raspberry Pi with a microphone and a speaker, and it could hook any automation routines and custom commands to it through event hooks.
-
2020: Second iteration on #platypush, this time supporting other assistant plugins too - Alexa (integration now removed), Snowboy (also removed, since the project is dead), Mozilla DeepSpeech (also removed now, since Mozilla discontinued it), PicoVoice, and mimic3 (the text-to-speech engine built on top of Mycroft, now bankrupt).
-
2024: Third iteration on Platypush, this time with an enhanced PicoVoice integration and new speech-to-text and text-to-speech plugins based on the OpenAI APIs.
But it's now 2026, and perhaps both the hardware and the software are now mature enough for fully on-device voice assistants based on fully open solutions likely to stick around for a while.
In this article we'll wire that gap closed with Platypush:
assistant.openwakewordlistens for the wake word locally.assistant.vosktranscribes the command locally.tts.piperspeaks the answer locally.openaiis used only where a language model is useful: turning messy speech into intent, or answering general questions.- Existing home automation plugins such as
light.hue,music.mpdorweather.openweathermapto provide the actions.
The result is not another cloud assistant with a different coat of paint. The hotword engine, speech recognition, command dispatch and speech synthesis can all run on-device. If the openai step points to a local OpenAI-compatible server, then the whole pipeline can stay on your LAN too.
The pipeline
The architecture can be summarized as follows:
Hotword detection ("OK Google", "Alexa" etc.) is a continuous, low-latency workload, and it should not need the network.
Speech-to-text is also a good fit for local inference: Vosk models are small enough to run on modest hardware, including Raspberry Pis, and they are perfectly adequate for short home automation commands.
Text-to-speech is another place where local models are good enough nowadays: Piper voices are fast, small and much nicer than the old robotic espeak-style fallback.
The only optional network-shaped piece is the language model.
But that is a policy choice, not a requirement of the voice stack.
Setup
Clone the assistant sample repository:
git clone https://git.platypush.tech/platypush/assistant-sample
cd assistant-sample
Models
The next step is to download the voice models used by the voice stack.
Hotword Detection
When the service starts the first time, it will automatically download all the available models.
You can then use the following command to list the available models once the service is running:
curl -s -XPOST \
-H 'Content-type: application/json' \
-H "Authorization: Bearer $PLATYPUSH_TOKEN" \
-d '{"type":"request", "action":"assistant.openwakeword.list_models"}' \
http://localhost:8008/execute
Where $PLATYPUSH_TOKEN is the token of the user that is running the service.
You can retrieve it by connecting to http://localhost:8008 when the service starts for the first time. Create your credentials, then select Settings -> Tokens -> Generate API Token.
Speech-to-text
A full list of the Vosk voice models is available here.
Some feedback about the quality of the English models:
| Model | Size | Notes |
|---|---|---|
vosk-model-small-en-us-0.15 |
40 MB | Very fast and lightweight model that can also run on an old Raspberry Pi, but accuracy can be low. |
vosk-model-en-us-0.22-lgraph |
128 MB | Reasonably accurate on clear speech and with native speakers, but still small enough to run fine even on a Raspberry Pi. |
vosk-model-en-us-0.22 |
1.8 GB | Accurate generic US English model. Fast on an laptop or x86 processor, but it may be a bit heavy on a Raspberry Pi. |
Download the selected model to the Docker volume working directory:
mkdir -p ./workdir/assistant.vosk/models
cd ./workdir/assistant.vosk/models
wget "https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-lgraph.zip"
unzip "vosk-model-en-us-0.22-lgraph.zip"
rm "vosk-model-en-us-0.22-lgraph.zip"
Text-to-speech
Download a speech synthesis model from here.
Audio samples are also available to get an idea of the type of voice before downloading.
The model usually consists of a *.onnx and a *.onnx.json file. Download both of them to the Docker volume working directory:
mkdir -p ./workdir/piper_tts
cd ./workdir/piper_tts
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx"
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx.json"
Configuration
Copy and edit the example configuration file.
cp config/config.example.yaml config/config.yaml
Home automation plugins
The assistant becomes useful once recognized speech can reach the rest of the house.
For example, Hue lights:
light.hue:
bridge: hue
groups:
- Living Room
And MPD/Mopidy for music:
music.mopidy:
host: localhost
music.mpd:
host: localhost
poll_interval: null
Those are just regular Platypush plugins.
The assistant does not need special knowledge about Hue, MPD, Chromecast, Zigbee, MQTT or anything else.
It only needs to emit events; your hooks decide what to do with them.
Build
Build the container image for the assistant service:
docker build -t platypush-voice .
Run
The assistant needs access to the host microphone and speakers. The container routes ALSA through PulseAudio, so the examples below connect it to a PulseAudio server running on the host.
Linux
With PulseAudio or pipewire-pulseaudio installed:
docker run --rm \
-e PULSE_SERVER=unix:/run/pulse/native \
-v /run/user/$(id -u)/pulse/native:/run/pulse/native \
--name voice-assistant \
-p 8008:8008 \
-v ./config:/etc/platypush \
-v ./workdir:/var/lib/platypush \
platypush-voice
macOS
Install and start PulseAudio on the host:
brew install pulseaudio
pulseaudio --daemonize=yes --exit-idle-time=-1
pactl load-module module-native-protocol-tcp \
auth-anonymous=1 \
listen=0.0.0.0 \
port=4713
Then start the container:
docker run --rm \
-e PULSE_SERVER=tcp:host.docker.internal:4713 \
--name voice-assistant \
-p 8008:8008 \
-v "$(pwd)/config:/etc/platypush" \
-v "$(pwd)/workdir:/var/lib/platypush" \
platypush-voice
If pactl load-module reports that the module is already loaded, you can keep using the existing PulseAudio daemon.
Windows
Install PulseAudio for Windows, then create a default.pa file in the same directory as pulseaudio.exe:
load-module module-waveout sink_name=output source_name=input record=1
load-module module-native-protocol-tcp auth-anonymous=1 listen=0.0.0.0 port=4713
set-default-sink output
set-default-source input
Start PulseAudio from PowerShell:
.\pulseaudio.exe -F .\default.pa --exit-idle-time=-1
Then start the container from the repository directory:
docker run --rm `
-e PULSE_SERVER=tcp:host.docker.internal:4713 `
--name voice-assistant `
-p 8008:8008 `
-v "${PWD}/config:/etc/platypush" `
-v "${PWD}/workdir:/var/lib/platypush" `
platypush-voice
Make sure microphone access is enabled for desktop applications under Windows privacy settings, and allow PulseAudio through the firewall if prompted.
Usage
Once the service is running, you can start interact with it with voice commands (the default activation word is "Alexa").
Any questions about the weather will be resolved by the weather plugin if it's been enabled.
If the music or lights plugins are enabled, they can be controlled with voice commands ("stop the music", "turn on the lights", etc.)
Otherwise, the assistant will use the openai plugin to respond to your questions, with follow-up turns when the response from OpenAI is also a question.
Extending the Assistant
The assistant logic is modeled through simple Platypush hooks under config/scripts.
You can extend it as you like by defining your own hooks or modifying the existing ones.
Starting a conversation
Conversations are started by hooking to the HotwordDetectedEvent.
import logging
from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent
logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"
@when(HotwordDetectedEvent)
def on_hotword_detected(event: HotwordDetectedEvent):
"""
When the hotword is detected, start a conversation.
"""
logger.info(f"Hotword {event.hotword} detected")
run(f"{assistant_plugin}.start_conversation")
Deterministic commands
For common home automation commands, regular event hooks are still the best tool. They are fast, inspectable, and they do not hallucinate.
from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def turn_on_lights():
"""
Hook run when the user says "turn on the lights" (regex)
"""
run("light.hue.on")
@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music():
"""
Hook run when the user says "play the music" (regex)
"""
run("music.mpd.play")
@when(SpeechRecognizedEvent, phrase="set the music volume (to|on|at) ${volume}")
def set_volume(volume: int):
"""
Hook run when the user says "set the music volume to ${volume}"
(regex with parameter).
"""
run("music.mpd.set_volume", volume=volume)
AI Commands
If the openai plugin is enabled, you can use it to help you answer questions.
There are two generic use-cases for voice assistants where an AI plugin is beneficial:
- Speech to Intent
- Response fallback
Speech to Intent
You may want this for general questions, for commands that do not fit a neat regular expression, or for transforming a raw sentence such as:
make it a bit darker and reduce the music volume
into a structured action plan like.
[
{
"action": "light.hue.set_lights",
"args": {
"bri": 50
}
},
{
"action": "music.mpd.set_volume",
"args": {
"volume": 20
}
}
]
An example provided in the assistant sample is that of weather forecasting.
Note in particular the usage of openai.get_response with a well crafted system prompt that turns a natural language request like:
What's the weather tomorrow in San Francisco?
Into:
{
"type": "weather",
"delta_days": 1,
"location": "San Francisco"
}
def parse_weather_request(request: str) -> WeatherRequest | None:
request_dict = (
run(
"openai.get_response",
context=[
{
"role": "system",
"content": (
"You are a voice assistant provided with weather requests as free text.\n"
"Given the prompt, return a structured JSON representation of the request in the following format: "
'{ "type": "weather", "delta_days": 1, "location": "San Francisco" }, '
'where both delta_days and location are optional (e.g. if the user simply asks "How\'s the weather?".\n'
'If the prompt doesn\'t seem to contain a weather request, return { "type": null }'
),
}
],
prompt=request,
)
or {}
)
if request_dict.get("type") != "weather":
return None
weather_request = WeatherRequest(
location=request_dict.get("location", default_location),
delta_days=request_dict.get("delta_days", 0),
)
return weather_request
You can also use the model for intermediate transformation instead of direct answers. For example, ask it to return a tiny JSON object with action and args, then dispatch only actions you explicitly allow:
ALLOWED_ACTIONS = {
"lights.on": "light.hue.on",
"lights.off": "light.hue.off",
"music.play": "music.mpd.play",
"music.stop": "music.mpd.stop",
}
@when(SpeechRecognizedEvent)
def on_fuzzy_command(event):
plan = run(
"openai.get_response",
prompt=event.phrase,
context=[
{
"role": "system",
"content": (
"Map the user command to JSON only: "
'{"action": "...", "args": {...}}. '
f"Allowed actions: {', '.join(ALLOWED_ACTIONS)}. "
"If none match, return {\"action\": null, \"args\": {}}."
),
}
],
)
# Parse `plan` as JSON here, validate it, then run only an allow-listed action.
That last validation step matters. A model may be useful for interpretation, but it should not get arbitrary access to run().
Response fallback
If a request doesn't match any of the commands you have defined, you can use a generic SpeechRecognizedEvent hook to forward the request to an AI plugin, and render the response as speech through the text-to-speech plugin.
import logging
from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent
logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"
@when(SpeechRecognizedEvent, plugin=assistant_plugin)
def on_speech_recognized(event: SpeechRecognizedEvent):
"""
Generic handler for speech recognition events received
by the configured assistant plugin.
"""
logger.info("Recognized speech: %s", event.phrase)
# Forward the request to OpenAI and render the response as speech
response = run(
f"{ai_plugin}.get_response",
prompt=event.phrase,
context=[
{
"role": "system",
"content": (
"You are a voice assistant that can answer questions and perform actions. "
"Keep in mind that prompts are transcriptions of user speech and they may "
"contain misspellings or errors. Try and interpret them as best as possible. "
"When possible, keep your answers short and concise."
),
}
],
)
# If the response is not empty, render it using the TTS plugin
if response:
event.assistant.render_response(response)
When a response from the LLM ends with a question mark, the assistant will automatically listen for a follow-up command and fire a new SpeechRecognizedEvent.
Pausing music while listening
One nice touch is to pause the music when a conversation starts and resume it after the assistant is done.
from platypush import run, when
from platypush.events.assistant import (
ConversationEndEvent,
ConversationStartEvent,
)
@when(ConversationStartEvent)
def on_conversation_start():
try:
run("utils.clear_timeout", name="ConversationEndTimeout")
except Exception as e:
logger.error("Error clearing conversation end timeout: %s", e)
run("music.mpd.pause_if_playing")
@when(ConversationEndEvent)
def on_conversation_end():
run(
"utils.set_timeout",
name="ConversationEndTimeout",
seconds=5,
actions=[{"action": "music.mpd.play_if_paused"}],
)
That makes the interaction feel much less clumsy: wake word, music ducks or pauses, command is recognized, answer is spoken, music resumes a few seconds later.
Going fully local
With the configuration above, hotword detection, speech-to-text, automation and text-to-speech are already local. The only non-local component is the openai plugin, if it points to OpenAI's servers.
To make the last step local too, run a model server that exposes an OpenAI-compatible API. Ollama, llama.cpp server, vLLM and LocalAI can all expose some version of /v1/chat/completions.
For example, with Ollama:
ollama pull llama3.1:8b
ollama serve
The OpenAI-compatible endpoint is then usually available at:
http://127.0.0.1:11434/v1/chat/completions
If your Platypush openai plugin version supports a custom API base URL, the configuration is the whole change:
openai:
model: llama3.1:8b
base_url: http://127.0.0.1:11434/v1
If it does not, keep the rest of the assistant exactly the same and replace only the fallback action with a tiny local request:
That is enough to turn the assistant into a fully local stack:
On a Raspberry Pi, I would still keep expectations realistic. Hotword detection, Vosk and Piper are fine on small machines. Local LLMs are the heavy piece. A Pi 5 with enough RAM can run small quantized models, but latency will not feel like a cloud model or a GPU-backed workstation. For many home automation workflows, that is acceptable because the LLM is only the fallback; the frequent commands stay deterministic.
Why this architecture ages well
Voice assistants have been a graveyard of abandoned SDKs and cloud products. Snowboy is gone. Mycroft is gone. The old Google Assistant SDK is deprecated. Vendor assistants are increasingly shaped around vendor ecosystems rather than user-controlled automation.
The safer long-term bet is not one monolithic assistant. It is a pipeline of small replaceable parts:
- Swap the hotword model without touching the automation logic.
- Swap Vosk for another STT engine without touching Hue or MPD.
- Swap OpenAI for a local OpenAI-compatible model without touching the wake word, TTS or command hooks.
- Swap Piper voices without touching the assistant flow.
Platypush is a good fit for this because its event system is already the boundary between perception and action. Speech recognition emits an event. Hooks decide what to do. Plugins execute the actions.
That separation is what makes the assistant inspectable. It is also what makes it possible to keep most of it on a Raspberry Pi in your house, instead of outsourcing the entire audio loop to a cloud service that may disappear, get worse, or decide one day that your use case is no longer part of the roadmap.
Final notes
The minimal version of this setup is small:
assistant.openwakewordfor the always-on wake word.assistant.voskfor local command transcription.- A few
@when(SpeechRecognizedEvent, phrase=...)hooks for deterministic commands. light.hue,music.mpdor any other Platypush plugin for actions.tts.piperfor local spoken responses.openai.get_responseonly where language understanding is worth the cost.
Start with the deterministic commands. Add the model fallback later. That way the assistant stays fast for the commands you use every day, while still being flexible enough to answer questions or interpret messy speech when you need it.