Build custom voice assistants

📝

An overview of the current technologies and how to leverage Platypush to build your customized assistant.

Mar 08, 2020

I wrote an article a while ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and a microphone.

It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to hook your own custom logic and scripts when certain phrases are recognized, without writing any code.

Since I wrote that article, a few things have changed:

When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve worked on supporting Alexa as well. Feel free to use the assistant.echo integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as output. It could also experience some minor audio glitches, at least on RasbperryPi.
Although deprecated, a new release of the Google Assistant Library has been made available to fix the segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But at least one of the best options out there to build a voice assistant will still work for a while. Those interested in building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more state-of-art alternatives. I’ve been a long-time fan of Snowboy, which has a well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory. I’ve also experimented with Mozilla DeepSpeech and PicoVoice products, for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
EDIT January 2021: Unfortunately, as of Dec 31st, 2020 Snowboy has been officially shut down. The GitHub repository is still there, you can still clone it and either use the example models provided under resources/models, train a model using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the website that could be used to browse and generate user models is no longer available. It's really a shame - the user models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work if you download and install the code from the repo.

The Case for DIY Voice Assistants

Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:

Privacy. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice interactions over a privately-owned channel through a privately-owned box.
Compatibility. A Google Assistant device will only work with devices that support Google Assistant. The same goes for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily, depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to interact with, and does not depend on business decisions.
Flexibility. Even when a device works with your assistant, you’re still bound to the features that have been agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky. In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services ( Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
Hardware constraints. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as long as that device has a way to communicate with the outside world. The logic to control that device should be able to run on the same network that the device belongs to.
Cloud vs. local processing. Most of the commercial voice assistants operate by regularly capturing streams of audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but, regardless of the technology, we should always be provided with a choice between decentralized and centralized computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or on-cloud, depending on the use case and depending on the user’s preference.
Scalability. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi, without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the possibility of becoming a smart device.

Overview of the voice assistant integrations

A voice assistant usually consists of two components:

An audio recorder that captures frames from an audio input device
A speech engine that keeps track of the current context.

There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a far higher overhead than just running hotword detection, which only has to compare the captured speech against the, usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if you say “Can I have a small double-shot espresso with a lot of sugar and some milk” they may return something like {" type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}).

In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and engines. Let’s go through some of the available integrations, and evaluate their pros and cons.

Native Google Assistant library

Integrations

assistant.google plugin (to programmatically start/stop conversations) and assistant.google backend (for continuous hotword detection).

Configuration

Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:

[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'

Authenticate to use the assistant-sdk-prototype scope:

export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE

Install Platypush with the HTTP backend and Google Assistant library support:

[sudo] pip install 'platypush[http,google-assistant-legacy]'

Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True

backend.assistant.google:
    enabled: True

assistant.google:
    enabled: True

Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on http://your-rpi:8008 you should be able to see your voice interactions in real-time.

Features

Hotword detection: YES (“Ok Google” or “Hey Google).
Speech detection: YES (once the hotword is detected).
Detection runs locally: NO (hotword detection [seems to] run locally, but once it's detected a channel is open with Google servers for the interaction).

Pros

It implements most of the features that you’d find in any Google Assistant products. That includes native support for timers, calendars, customized responses on the basis of your profile and location, native integration with the devices configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks on e.g. speech detected or conversation start/end events.
Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes around 2–3% of the CPU on a RaspberryPi 4.

Cons

The Google Assistant library used as a backend by the integration has been deprecated by Google. It still works on most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
If your main goal is to operate voice-enabled services within a secure environment with no processing happening on someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and, potentially, review.

Google Assistant Push-To-Talk Integration

Integrations

assistant.google.pushtotalk plugin.

Configuration

Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:

[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'

Authenticate to use the assistant-sdk-prototype scope:

export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json

google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
      --scope https://www.googleapis.com/auth/gcm \
      --save --headless --client-secrets $CREDENTIALS_FILE

Install Platypush with the HTTP backend and Google Assistant SDK support:

[sudo] pip install 'platypush[http,google-assistant]'

Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True

assistant.google.pushtotalk:
    language: en-US

Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute

Features

Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
Speech detection: YES.
Detection runs locally: NO (you can customize the hotword engine and how to trigger the assistant, but once a conversation is started a channel is opened with Google servers).

Pros

It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or alarms).
Rock-solid speech detection, using the same speech model used by Google Assistant products.
Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses resources only when you call start_conversation.
It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between your mic and Google’s servers. The connection is only opened upon start_conversation. This makes it a good option if privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press, motion sensor event or web call.

Cons

I’ve built this integration after the deprecation of the Google Assistant library occurred with no official alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples ( pushtotalk.py) and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to be replaced by Google.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

Alexa Integration

Integrations

assistant.echo plugin.

Configuration

Install Platypush with the HTTP backend and Alexa support:

[sudo] pip install 'platypush[http,alexa]'

Run alexa-auth. It will start a local web server on your machine on http://your-rpi:3000. Open it in your browser and authenticate with your Amazon account. A credentials file should be generated under ~/.avs.json.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True

assistant.echo:
    enabled: True

Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute

Features

Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
Speech detection: YES (although limited: transcription of the processed audio won’t be provided).
Detection runs locally: NO.

Pros

It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t available. Also, the support for skills or media control may be limited.
Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
Good performance even on low-power devices. No hotword engine running means it uses resources only when you call start_conversation.
It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and Amazon’s servers. The connection is only opened upon start_conversation.

Cons

The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio responses. It means that text transcription, either for the request or the response, won’t be available. That limits what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.

Snowboy Integration

Integrations

assistant.snowboy backend.

Configuration

Install Platypush with the HTTP backend and Snowboy support:

[sudo] pip install 'platypush[http,snowboy]'

Choose your hotword model(s). Some are available under SNOWBOY_INSTALL_DIR/resources/models. Otherwise, you can train or download models from the Snowboy website.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:

backend.http:
    enabled: True

backend.assistant.snowboy:
    audio_gain: 1.2
    models:
        # Trigger the Google assistant in Italian when I say "computer"
        computer:
            voice_model_file: ~/models/computer.umdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: it-IT
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger the Google assistant in English when I say "OK Google"
        ok_google:
            voice_model_file: ~/models/OK Google.pmdl
            assistant_plugin: assistant.google.pushtotalk
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.4

        # Trigger Alexa when I say "Alexa"
        alexa:
            voice_model_file: ~/models/Alexa.pmdl
            assistant_plugin: assistant.echo
            assistant_language: en-US
            detect_sound: ~/sounds/bell.wav
            sensitivity: 0.5

Start Platypush. Say the hotword associated with one of your models, check on the logs that the HotwordDetectedEvent is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly started.

Features

Hotword detection: YES.
Speech detection: NO.
Detection runs locally: YES.

Pros

I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning. You can download any hotword models for free from their website, provided that you record three audio samples of you saying that word in order to help improve the model. You can also create your custom hotword model, and if enough people are interested in using it then they’ll contribute with their samples, and the model will become more robust over time. I believe that more machine learning projects out there could really benefit from this “use it for free as long as you help improve the model” paradigm.
Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make a multi-language and multi-hotword voice assistant.
Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never exceeded 20–25%.
The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection to run and no data exchanged with any cloud.

Cons

Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up, the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who aren’t native English speakers).

Mozilla DeepSpeech

Integrations

stt.deepspeech plugin and stt.deepspeech backend (for continuous detection).

Configuration

Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that gets installed:

[sudo] pip install 'platypush[http,deepspeech]'

Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while depending on your connection:

export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1

wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz

tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite

mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR

Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:

backend.http:
    enabled: True

stt.deepspeech:
    model_file: ~/models/output_graph.pbmm
    lm_file: ~/models/lm.binary
    trie_file: ~/models/trie

    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello

    conversation_timeout: 5

backend.stt.deepspeech:
    enabled: True

Start Platypush. Speech detection will start running on startup. SpeechDetectedEvents will be triggered when you talk. HotwordDetectedEvents will be triggered when you say one of the configured hotwords. ConversationDetectedEvents will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the continuous detection and only start it programmatically by calling stt.deepspeech.start_detection and stt.deepspeech.stop_detection. You can also use it to perform offline speech transcription from audio files:

curl -XPOST \
  -H "Authorization: Bearer $PP_TOKEN" \
  -H 'Content-Type: application/json' -d '
{
    "type":"request",
    "action":"stt.deepspeech.detect",
    "args": {
        "audio_file": "~/audio.wav"
    }
}' http://your-rpi:8008/execute

{
    "type":"response",
    "target":"http",
    "response": {
        "errors":[],
        "output": {
            "speech": "This is a test"
        }
    }
}

Features

Hotword detection: YES.
Speech detection: YES.
Detection runs locally: YES.

Pros

I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version 0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can easily extend the Tensorflow model by training it with your own samples.
Speech-to-text transcription of audio files can be a very useful feature.

Cons

DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on less powerful machines.
DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.” “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text transcription purposes but, in such ambiguous cases, it lacks some semantic context.
Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the speech detection part.

PicoVoice

PicoVoice is a very promising company that has released several products for performing voice detection on-device. Among them:

Porcupine, a hotword engine.
Leopard, a speech-to-text offline transcription engine.
Cheetah, a speech-to-text engine for real-time applications.
Rhino, a speech-to-intent engine.

So far, Platypush provides integrations with Porcupine and Cheetah.

Integrations

Hotword engine: stt.picovoice.hotword plugin and stt.picovoice.hotword backend (for continuous detection).
Speech engine: stt.picovoice.speech plugin and stt.picovoice.speech backend (for continuous detection).

Configuration

Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:

[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'

Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:

stt.picovoice.hotword:
    # Custom list of hotwords
    hotwords:
        - computer
        - alexa
        - hello

# Enable continuous hotword detection
backend.stt.picovoice.hotword:
    enabled: True

# Enable continuous speech detection
# backend.stt.picovoice.speech:
#     enabled: True

# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
    if:
        type: platypush.message.event.stt.HotwordDetectedEvent
    then:
        # Start a timer that stops the detection in 10 seconds
        - action: utils.set_timeout
          args:
              seconds: 10
              name: StopSpeechDetection
              actions:
                  - action: stt.picovoice.speech.stop_detection

        - action: stt.picovoice.speech.start_detection

Start Platypush and enjoy your on-device voice assistant.

Features

Hotword detection: YES.
Speech detection: YES.
Detection runs locally: YES.

Pros

When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on older models of RaspberryPi.

Cons

While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into how they’ve solved the problem.
Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any chance to extend the model or use a different model, is only possible through a commercial license. While I understand their point and their business model, I’d have been super-happy to just pay for a license through a more friendly process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you” paradigm.
Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.

Conclusions

The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice at all. But at least some solutions are emerging to bring speech detection to all devices.

I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice integrations in the same product — and especially having voice integrations that expose all the same API and generate the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech recognition from the business logic that can be run by voice commands.

Check out my previous article to learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop events.

To summarize my findings so far:

Use the native Google Assistant integration if you want to have a full Google experience, and if you’re ok with Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant library won’t work anymore.
Use the Google push-to-talk integration if you only want to have the assistant, without hotword detection, or you want your assistant to be triggered by alternative hotwords.
Use the Alexa integration if you already have an Amazon-powered ecosystem and you’re ok with having less flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
Use Snowboy if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not be that accurate.
Use Mozilla DeepSpeech if you want a fully on-device open-source engine powered by a robust Tensorflow model, even if it takes more CPU load and a bit more latency.
Use PicoVoice solutions if you want a full voice solution that runs on-device and it’s both accurate and performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.

Let me know your thoughts on these solutions and your experience with these integrations!

Reactions

How to interact with this page

Webmentions

To interact via Webmentions, send an activity that references this URL from a platform that supports Webmentions, such as Lemmy, WordPress with Webmention plugins, or any IndieWeb-compatible site.

ActivityPub

Follow @blog@platypush.tech on your ActivityPub platform (e.g. Mastodon, Misskey, Pleroma, Lemmy).
Mention @blog@platypush.tech in a post to feature on the Guestbook.
Search for this URL on your instance to find and interact with the post.
Like, boost, quote, or reply to the post to feature your activity here.

📣 4 🔗 4

Fabio Manganiello Jun 22, 2026 @ 20:19

blog.platypush.tech

Build a fully local voice assistant in 2026

Those who have followed me for a while know of my personal obsession with self-built voice assistants.

My experiments over the years can be summarized as it follows:

2007: Voxifera, my very first attempt at building a primitive voice assistant using Hidden Markov models. Definitely not good for general-purpose usage, but good enough in 2007 to distinguish between a dozen of simple voice commands.
2019: First voice assistant built on top of Platypush. It used the now deprecated Google Assistant Library on top of a Raspberry Pi with a microphone and a speaker, and it could hook any automation routines and custom commands to it through event hooks.
2020: Second iteration on #platypush, this time supporting other assistant plugins too - Alexa (integration now removed), Snowboy (also removed, since the project is dead), Mozilla DeepSpeech (also removed now, since Mozilla discontinued it), PicoVoice, and mimic3 (the text-to-speech engine built on top of Mycroft, now bankrupt).
2024: Third iteration on Platypush, this time with an enhanced PicoVoice integration and new speech-to-text and text-to-speech plugins based on the OpenAI APIs.

But it's now 2026, and perhaps both the hardware and the software are now mature enough for fully on-device voice assistants based on fully open solutions likely to stick around for a while.

In this article we'll wire that gap closed with Platypush:

assistant.openwakeword listens for the wake word locally.
assistant.vosk transcribes the command locally.
tts.piper speaks the answer locally.
openai is used only where a language model is useful: turning messy speech into intent, or answering general questions.
Existing home automation plugins such as light.hue, music.mpd or weather.openweathermap to provide the actions.

The result is not another cloud assistant with a different coat of paint. The hotword engine, speech recognition, command dispatch and speech synthesis can all run on-device. If the openai step points to a local OpenAI-compatible server, then the whole pipeline can stay on your LAN too.

The pipeline

The architecture can be summarized as follows:

Hotword detection ("OK Google", "Alexa" etc.) is a continuous, low-latency workload, and it should not need the network.

Speech-to-text is also a good fit for local inference: Vosk models are small enough to run on modest hardware, including Raspberry Pis, and they are perfectly adequate for short home automation commands.

Text-to-speech is another place where local models are good enough nowadays: Piper voices are fast, small and much nicer than the old robotic espeak-style fallback.

The only optional network-shaped piece is the language model.

But that is a policy choice, not a requirement of the voice stack.

Setup

Clone the assistant sample repository:

git clone https://git.platypush.tech/platypush/assistant-sample
cd assistant-sample

Models

The next step is to download the voice models used by the voice stack.

Hotword Detection

When the service starts the first time, it will automatically download all the available models.

You can then use the following command to list the available models once the service is running:

curl -s -XPOST \
     -H 'Content-type: application/json' \
     -H "Authorization: Bearer $PLATYPUSH_TOKEN" \
     -d '{"type":"request", "action":"assistant.openwakeword.list_models"}' \
     http://localhost:8008/execute

Where $PLATYPUSH_TOKEN is the token of the user that is running the service.

You can retrieve it by connecting to http://localhost:8008 when the service starts for the first time. Create your credentials, then select Settings -> Tokens -> Generate API Token.

Speech-to-text

A full list of the Vosk voice models is available here.

Some feedback about the quality of the English models:

Model	Size	Notes
`vosk-model-small-en-us-0.15`	40 MB	Very fast and lightweight model that can also run on an old Raspberry Pi, but accuracy can be low.
`vosk-model-en-us-0.22-lgraph`	128 MB	Reasonably accurate on clear speech and with native speakers, but still small enough to run fine even on a Raspberry Pi.
`vosk-model-en-us-0.22`	1.8 GB	Accurate generic US English model. Fast on an laptop or x86 processor, but it may be a bit heavy on a Raspberry Pi.

Download the selected model to the Docker volume working directory:

mkdir -p ./workdir/assistant.vosk/models
cd ./workdir/assistant.vosk/models
wget "https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-lgraph.zip"
unzip "vosk-model-en-us-0.22-lgraph.zip"
rm "vosk-model-en-us-0.22-lgraph.zip"

Text-to-speech

Download a speech synthesis model from here.

Audio samples are also available to get an idea of the type of voice before downloading.

The model usually consists of a *.onnx and a *.onnx.json file. Download both of them to the Docker volume working directory:

mkdir -p ./workdir/piper_tts
cd ./workdir/piper_tts
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx"
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx.json"

Configuration

Copy and edit the example configuration file.

cp config/config.example.yaml config/config.yaml

Home automation plugins

The assistant becomes useful once recognized speech can reach the rest of the house.

For example, Hue lights:

light.hue:
  bridge: hue
  groups:
    - Living Room

And MPD/Mopidy for music:

music.mopidy:
  host: localhost

music.mpd:
  host: localhost
  poll_interval: null

Those are just regular Platypush plugins.

The assistant does not need special knowledge about Hue, MPD, Chromecast, Zigbee, MQTT or anything else.

It only needs to emit events; your hooks decide what to do with them.

Build

Build the container image for the assistant service:

docker build -t platypush-voice .

Run

The assistant needs access to the host microphone and speakers. The container routes ALSA through PulseAudio, so the examples below connect it to a PulseAudio server running on the host.

Linux

With PulseAudio or pipewire-pulseaudio installed:

docker run --rm \
  -e PULSE_SERVER=unix:/run/pulse/native \
  -v /run/user/$(id -u)/pulse/native:/run/pulse/native \
  --name voice-assistant \
  -p 8008:8008 \
  -v ./config:/etc/platypush \
  -v ./workdir:/var/lib/platypush \
  platypush-voice

macOS

Install and start PulseAudio on the host:

brew install pulseaudio
pulseaudio --daemonize=yes --exit-idle-time=-1
pactl load-module module-native-protocol-tcp \
  auth-anonymous=1 \
  listen=0.0.0.0 \
  port=4713

Then start the container:

docker run --rm \
  -e PULSE_SERVER=tcp:host.docker.internal:4713 \
  --name voice-assistant \
  -p 8008:8008 \
  -v "$(pwd)/config:/etc/platypush" \
  -v "$(pwd)/workdir:/var/lib/platypush" \
  platypush-voice

If pactl load-module reports that the module is already loaded, you can keep using the existing PulseAudio daemon.

Windows

Install PulseAudio for Windows, then create a default.pa file in the same directory as pulseaudio.exe:

load-module module-waveout sink_name=output source_name=input record=1
load-module module-native-protocol-tcp auth-anonymous=1 listen=0.0.0.0 port=4713
set-default-sink output
set-default-source input

Start PulseAudio from PowerShell:

.\pulseaudio.exe -F .\default.pa --exit-idle-time=-1

Then start the container from the repository directory:

docker run --rm `
  -e PULSE_SERVER=tcp:host.docker.internal:4713 `
  --name voice-assistant `
  -p 8008:8008 `
  -v "${PWD}/config:/etc/platypush" `
  -v "${PWD}/workdir:/var/lib/platypush" `
  platypush-voice

Make sure microphone access is enabled for desktop applications under Windows privacy settings, and allow PulseAudio through the firewall if prompted.

Usage

Once the service is running, you can start interact with it with voice commands (the default activation word is "Alexa").

Any questions about the weather will be resolved by the weather plugin if it's been enabled.

If the music or lights plugins are enabled, they can be controlled with voice commands ("stop the music", "turn on the lights", etc.)

Otherwise, the assistant will use the openai plugin to respond to your questions, with follow-up turns when the response from OpenAI is also a question.

Extending the Assistant

The assistant logic is modeled through simple Platypush hooks under config/scripts.

You can extend it as you like by defining your own hooks or modifying the existing ones.

Starting a conversation

Conversations are started by hooking to the HotwordDetectedEvent.

import logging

from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent

logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"


@when(HotwordDetectedEvent)
def on_hotword_detected(event: HotwordDetectedEvent):
    """
    When the hotword is detected, start a conversation.
    """
    logger.info(f"Hotword {event.hotword} detected")
    run(f"{assistant_plugin}.start_conversation")

Deterministic commands

For common home automation commands, regular event hooks are still the best tool. They are fast, inspectable, and they do not hallucinate.

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent


@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def turn_on_lights():
    """
    Hook run when the user says "turn on the lights" (regex)
    """
    run("light.hue.on")


@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music():
    """
    Hook run when the user says "play the music" (regex)
    """
    run("music.mpd.play")


@when(SpeechRecognizedEvent, phrase="set the music volume (to|on|at) ${volume}")
def set_volume(volume: int):
    """
    Hook run when the user says "set the music volume to ${volume}"
    (regex with parameter).
    """
    run("music.mpd.set_volume", volume=volume)

AI Commands

If the openai plugin is enabled, you can use it to help you answer questions.

There are two generic use-cases for voice assistants where an AI plugin is beneficial:

Speech to Intent
Response fallback

Speech to Intent

You may want this for general questions, for commands that do not fit a neat regular expression, or for transforming a raw sentence such as:

make it a bit darker and reduce the music volume

into a structured action plan like.

[
  {
    "action": "light.hue.set_lights",
    "args": {
      "bri": 50
    }
  },
  {
    "action": "music.mpd.set_volume",
    "args": {
      "volume": 20
    }
  }
]

An example provided in the assistant sample is that of weather forecasting.

Note in particular the usage of openai.get_response with a well crafted system prompt that turns a natural language request like:

What's the weather tomorrow in San Francisco?

Into:

{
  "type": "weather",
  "delta_days": 1,
  "location": "San Francisco"
}

def parse_weather_request(request: str) -> WeatherRequest | None:
    request_dict = (
        run(
            "openai.get_response",
            context=[
                {
                    "role": "system",
                    "content": (
                        "You are a voice assistant provided with weather requests as free text.\n"
                        "Given the prompt, return a structured JSON representation of the request in the following format: "
                        '{ "type": "weather", "delta_days": 1, "location": "San Francisco" }, '
                        'where both delta_days and location are optional (e.g. if the user simply asks "How\'s the weather?".\n'
                        'If the prompt doesn\'t seem to contain a weather request, return { "type": null }'
                    ),
                }
            ],
            prompt=request,
        )
        or {}
    )

    if request_dict.get("type") != "weather":
        return None

    weather_request = WeatherRequest(
        location=request_dict.get("location", default_location),
        delta_days=request_dict.get("delta_days", 0),
    )

    return weather_request

You can also use the model for intermediate transformation instead of direct answers. For example, ask it to return a tiny JSON object with action and args, then dispatch only actions you explicitly allow:

ALLOWED_ACTIONS = {
    "lights.on": "light.hue.on",
    "lights.off": "light.hue.off",
    "music.play": "music.mpd.play",
    "music.stop": "music.mpd.stop",
}


@when(SpeechRecognizedEvent)
def on_fuzzy_command(event):
    plan = run(
        "openai.get_response",
        prompt=event.phrase,
        context=[
            {
                "role": "system",
                "content": (
                    "Map the user command to JSON only: "
                    '{"action": "...", "args": {...}}. '
                    f"Allowed actions: {', '.join(ALLOWED_ACTIONS)}. "
                    "If none match, return {\"action\": null, \"args\": {}}."
                ),
            }
        ],
    )

    # Parse `plan` as JSON here, validate it, then run only an allow-listed action.

That last validation step matters. A model may be useful for interpretation, but it should not get arbitrary access to run().

Response fallback

If a request doesn't match any of the commands you have defined, you can use a generic SpeechRecognizedEvent hook to forward the request to an AI plugin, and render the response as speech through the text-to-speech plugin.

import logging

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent

logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"


@when(SpeechRecognizedEvent, plugin=assistant_plugin)
def on_speech_recognized(event: SpeechRecognizedEvent):
    """
    Generic handler for speech recognition events received
    by the configured assistant plugin.
    """
    logger.info("Recognized speech: %s", event.phrase)

    # Forward the request to OpenAI and render the response as speech
    response = run(
        f"{ai_plugin}.get_response",
        prompt=event.phrase,
        context=[
            {
                "role": "system",
                "content": (
                    "You are a voice assistant that can answer questions and perform actions. "
                    "Keep in mind that prompts are transcriptions of user speech and they may "
                    "contain misspellings or errors. Try and interpret them as best as possible. "
                    "When possible, keep your answers short and concise."
                ),
            }
        ],
    )

    # If the response is not empty, render it using the TTS plugin
    if response:
        event.assistant.render_response(response)

When a response from the LLM ends with a question mark, the assistant will automatically listen for a follow-up command and fire a new SpeechRecognizedEvent.

Pausing music while listening

One nice touch is to pause the music when a conversation starts and resume it after the assistant is done.

from platypush import run, when
from platypush.events.assistant import (
    ConversationEndEvent,
    ConversationStartEvent,
)


@when(ConversationStartEvent)
def on_conversation_start():
    try:
        run("utils.clear_timeout", name="ConversationEndTimeout")
    except Exception as e:
        logger.error("Error clearing conversation end timeout: %s", e)

    run("music.mpd.pause_if_playing")


@when(ConversationEndEvent)
def on_conversation_end():
    run(
        "utils.set_timeout",
        name="ConversationEndTimeout",
        seconds=5,
        actions=[{"action": "music.mpd.play_if_paused"}],
    )

That makes the interaction feel much less clumsy: wake word, music ducks or pauses, command is recognized, answer is spoken, music resumes a few seconds later.

Going fully local

With the configuration above, hotword detection, speech-to-text, automation and text-to-speech are already local. The only non-local component is the openai plugin, if it points to OpenAI's servers.

To make the last step local too, run a model server that exposes an OpenAI-compatible API. Ollama, llama.cpp server, vLLM and LocalAI can all expose some version of /v1/chat/completions.

For example, with Ollama:

ollama pull llama3.1:8b
ollama serve

The OpenAI-compatible endpoint is then usually available at:

http://127.0.0.1:11434/v1/chat/completions

If your Platypush openai plugin version supports a custom API base URL, the configuration is the whole change:

openai:
  model: llama3.1:8b
  base_url: http://127.0.0.1:11434/v1

If it does not, keep the rest of the assistant exactly the same and replace only the fallback action with a tiny local request:

That is enough to turn the assistant into a fully local stack:

On a Raspberry Pi, I would still keep expectations realistic. Hotword detection, Vosk and Piper are fine on small machines. Local LLMs are the heavy piece. A Pi 5 with enough RAM can run small quantized models, but latency will not feel like a cloud model or a GPU-backed workstation. For many home automation workflows, that is acceptable because the LLM is only the fallback; the frequent commands stay deterministic.

Why this architecture ages well

Voice assistants have been a graveyard of abandoned SDKs and cloud products. Snowboy is gone. Mycroft is gone. The old Google Assistant SDK is deprecated. Vendor assistants are increasingly shaped around vendor ecosystems rather than user-controlled automation.

The safer long-term bet is not one monolithic assistant. It is a pipeline of small replaceable parts:

Swap the hotword model without touching the automation logic.
Swap Vosk for another STT engine without touching Hue or MPD.
Swap OpenAI for a local OpenAI-compatible model without touching the wake word, TTS or command hooks.
Swap Piper voices without touching the assistant flow.

Platypush is a good fit for this because its event system is already the boundary between perception and action. Speech recognition emits an event. Hooks decide what to do. Plugins execute the actions.

That separation is what makes the assistant inspectable. It is also what makes it possible to keep most of it on a Raspberry Pi in your house, instead of outsourcing the entire audio loop to a cloud service that may disappear, get worse, or decide one day that your use case is no longer part of the roadmap.

Final notes

The minimal version of this setup is small:

assistant.openwakeword for the always-on wake word.
assistant.vosk for local command transcription.
A few @when(SpeechRecognizedEvent, phrase=...) hooks for deterministic commands.
light.hue, music.mpd or any other Platypush plugin for actions.
tts.piper for local spoken responses.
openai.get_response only where language understanding is worth the cost.

Start with the deterministic commands. Add the model fallback later. That way the assistant stays fast for the commands you use every day, while still being flexible enough to answer questions or interpret messy speech when you need it.

🔁 3

Fabio Manganiello Jun 02, 2024 @ 00:00

blog.platypush.tech

The state of voice assistant integrations in 2024

Those who have been following my blog or used Platypush for a while probably know that I've put quite some efforts to get voice assistants rights over the past few years. I built my first (very primitive) voice assistant that used DCT+Markov models back in 2008, when the concept was still pretty much a science fiction novelty. Then I wrote an article in 2019 and one in 2020 on how to use several voice integrations in Platypush to create custom voice assistants. Everyone in those pictures is now dead Quite a few things have changed in this industry niche since I wrote my previous article. Most of the solutions that I covered back in the day, unfortunately, are gone in a way or another: The assistant.snowboy integration is gone because unfortunately Snowboy is gone. For a while you could still run the Snowboy code with models that either you had previously downloaded from their website or trained yourself, but my latest experience proved to be quite unfruitful - it's been more than 4 years since the last commit on Snowboy, and it's hard to get the code to even run. The assistant.alexa integration is also gone, as Amazon has stopped maintaining the AVS SDK. And I have literally no clue of what Amazon's plans with the development of Alexa skills are (if there are any plans at all). The stt.deepspeech integration is also gone: the project hasn't seen a commit in 3 years and I even struggled to get the latest code to run. Given the current financial situation at Mozilla, and the fact that they're trying to cut as much as possible on what they don't consider part of their core product, it's very unlikely that DeepSpeech will be revived any time soon. The assistant.google integration is still there, but I can't make promises on how long it can be maintained. It uses the google-assistant-library, which was deprecated in 2019. Google replaced it with the conversational actions, which was also deprecated last year. Put here your joke about Google building products with the shelf life of a summer hit. The tts.mimic3 integration, a text model based on mimic3, part of the Mycroft initiative, is still there, but only because it's still possible to spin up a Docker image that runs mimic3. The whole Mycroft project, however, is now defunct, and the story of how it went bankrupt is a very sad story about the power that patent trolls have on startups. The Mycroft initiative however seems to have been picked up by the community, and something seems to move in the space of fully open source and on-device voice models. I'll definitely be looking with interest at what happens in that space, but the project seems to be at a stage that is still a bit immature to justify an investment into a new Platypush integration. But not all hope is lost assistant.google assistant.google may be relying on a dead library, but it's not dead (yet). The code still works, but you're a bit constrained on the hardware side - the assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3 and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other ARMv7-compatible devices has proved to be a challenge in some cases. Given the state of the library, it's safe to say that it'll never be supported on other platforms, but if you want to run your assistant on a device that is still supported then it should still work fine. I had however to do a few dirty packaging tricks to ensure that the assistant library code doesn't break badly on newer versions of Python. That code hasn't been touched in 5 years and it's starting to rot. It depends on ancient and deprecated Python libraries like enum34 and it needs some hammering to work - without breaking the whole Python environment in the process. For now, pip install 'platypush[assistant.google]' should do all the dirty work and get all of your assistant dependencies installed. But I can't promise I can maintain that code forever. assistant.picovoice Picovoice has been a nice surprise in an industry niche where all the products that were available just 4 years ago are now dead. I described some of their products in my previous articles, and I even built a couple of stt.picovoice.* plugins for Platypush back in the day, but I didn't really put much effort in it. Their business model seemed a bit weird - along the lines of "you can test our products on x86_64, if you need an ARM build you should contact us as a business partner". And the quality of their products was also a bit disappointing compared to other mainstream offerings. I'm glad to see that the situation has changed quite a bit now. They still have a "sign up with a business email" model, but at least now you can just sign up on their website and start using their products rather than sending emails around. And I'm also quite impressed to see the progress on their website. You can now train hotword models, customize speech-to-text models and build your own intent rules directly from their website - a feature that was also available in the beloved Snowboy and that went missing from any major product offerings out there after Snowboy was gone. I feel like the quality of their models has also greatly improved compared to the last time I checked them - predictions are still slower than the Google Assistant, definitely less accurate with non-native accents, but the gap with the Google Assistant when it comes to native accents isn't very wide. assistant.openai OpenAI has filled many gaps left by all the casualties in the voice assistants market. Platypush now provides a new assistant.openai plugin that stitches together several of their APIs to provide a voice assistant experience that honestly feels much more natural than anything I've tried in all these years. Let's explore how to use these integrations to build our on-device voice assistant with custom rules. Feature comparison As some of you may know, voice assistant often aren't monolithic products. Unless explicitly designed as all-in-one packages (like the google-assistant-library), voice assistant integrations in Platypush are usually built on top of four distinct APIs: Hotword detection: This is the component that continuously listens on your microphone until you speak "Ok Google", "Alexa" or any other wake-up word used to start a conversation. Since it's a continuously listening component that needs to take decisions fast, and it only has to recognize one word (or in a few cases 3-4 more at most), it usually doesn't need to run on a full language model. It needs small models, often a couple of MBs heavy at most. Speech-to-text (STT): This is the component that will capture audio from the microphone and use some API to transcribe it to text. Response engine: Once you have the transcription of what the user said, you need to feed it to some model that will generate some human-like response for the question. Text-to-speech (TTS): Once you have your AI response rendered as a text string, you need a text-to-speech model to speak it out loud on your speakers or headphones. On top of these basic building blocks for a voice assistant, some integrations may also provide two extra features. Speech-to-intent In this mode, the user's prompt, instead of being transcribed directly to text, is transcribed into a structured intent that can be more easily processed by a downstream integration with no need for extra text parsing, regular expressions etc. For instance, a voice command like "turn off the bedroom lights" could be translated into an intent such as: { "intent": "lights_ctrl", "slots": { "state": "off", "lights": "bedroom" } } Offline speech-to-text a.k.a. offline text transcriptions. Some assistant integrations may offer you the ability to pass some audio file and transcribe their content as text. Features summary This table summarizes how the assistant integrations available in Platypush compare when it comes to what I would call the foundational blocks: Plugin Hotword STT AI responses TTS assistant.google ✅ ✅ ✅ ✅ assistant.openai ❌ ✅ ✅ ✅ assistant.picovoice ✅ ✅ ❌ ✅ And this is how they compare in terms of extra features: Plugin Intents Offline SST assistant.google ❌ ❌ assistant.openai ❌ ✅ assistant.picovoice ✅ ✅ Let's see a few configuration examples to better understand the pros and cons of each of these integrations. Configuration Hardware requirements A computer, a Raspberry Pi, an old tablet, or anything in between, as long as it can run Python. At least 1GB of RAM is advised for smooth audio processing experience. A microphone. Speaker/headphones. Installation notes Platypush 1.0.0 has recently been released, and new installation procedures with it. There's now official support for several package managers, a better Docker installation process, and more powerful ways to install plugins - via pip extras, Web interface, Docker and virtual environments. The optional dependencies for any Platypush plugins can be installed via pip extras in the simplest case: $ pip install 'platypush[plugin1,plugin2,...]' For example, if you want to install Platypush with the dependencies for assistant.openai and assistant.picovoice: $ pip install 'platypush[assistant.openai,assistant.picovoice]' Some plugins however may require extra system dependencies that are not available via pip - for instance, both the OpenAI and Picovoice integrations require the ffmpeg binary to be installed, as it is used for audio conversion and exporting purposes. You can check the plugins documentation for any system dependencies required by some integrations, or install them automatically through the Web interface or the platydock command for Docker containers. A note on the hooks All the custom actions in this article are built through event hooks triggered by SpeechRecognizedEvent (or IntentRecognizedEvent for intents). When an intent event is triggered, or a speech event with a condition on a phrase, the assistant integrations in Platypush will prevent the default assistant response. That's to avoid cases where e.g. you say "turn off the lights", your hook takes care of running the actual action, while your voice assistant fetches a response from Google or ChatGPT along the lines of "sorry, I can't control your lights". If you want to render a custom response from an event hook, you can do so by calling event.assistant.render_response(text), and it will be spoken using the available text-to-speech integration. If you want to disable this behaviour, and you want the default assistant response to always be rendered, even if it matches a hook with a phrase or an intent, you can do so by setting the stop_conversation_on_speech_match parameter to false in your assistant plugin configuration. Text-to-speech Each of the available assistant plugins has it own default tts plugin associated: assistant.google: tts, but tts.google is also available. The difference is that tts uses the (unofficial) Google Translate frontend API - it requires no extra configuration, but besides setting the input language it isn't very configurable. tts.google on the other hand uses the Google Cloud Translation API. It is much more versatile, but it requires an extra API registered to your Google project and an extra credentials file. assistant.openai: tts.openai, which leverages the OpenAI text-to-speech API. assistant.picovoice: tts.picovoice, which uses the (still experimental, at the time of writing) Picovoice Orca engine. Any text rendered via assistant*.render_response will be rendered using the associated TTS plugin. You can however customize it by setting tts_plugin on your assistant plugin configuration - e.g. you can render responses from the OpenAI assistant through the Google or Picovoice engine, or the other way around. tts plugins also expose a say action that can be called outside of an assistant context to render custom text at runtime - for example, from other event hooks, procedures, cronjobs or API calls. For example: $ curl -XPOST -H "Authorization: Bearer $TOKEN" -d ' { "type": "request", "action": "tts.openai.say", "args": { "text": "What a wonderful day!" } } ' http://localhost:8008/execute assistant.google Plugin documentation pip installation: pip install 'platypush[assistant.google]' This is the oldest voice integration in Platypush - and one of the use-cases that actually motivated me into forking the previous project into what is now Platypush. As mentioned in the previous section, this integration is built on top of a deprecated library (with no available alternatives) that just so happens to still work with a bit of hammering on x86_64 and Raspberry Pi 3/4. Personally it's the voice assistant I still use on most of my devices, but it's definitely not guaranteed that it will keep working in the future. Once you have installed Platypush with the dependencies for this integration, you can configure it through these steps: Create a new project on the Google developers console and generate a new set of credentials for it. Download the credentials secrets as JSON. Generate scoped credentials from your secrets.json. Configure the integration in your config.yaml for Platypush (see the configuration page for more details):assistant.google: # Default: ~/.config/google-oauthlib-tool/credentials.json # or /credentials/google/assistant.json credentials_file: /path/to/credentials.json # Default: no sound is played when "Ok Google" is detected conversation_start_sound: /path/to/sound.mp3 Restart the service, say "Ok Google" or "Hey Google" while the microphone is active, and everything should work out of the box. You can now start creating event hooks to execute your custom voice commands. For example, if you configured a lights plugin (e.g. light.hue) and a music plugin (e.g. music.mopidy), you can start building voice commands like these: # Content of e.g. /path/to/config_yaml/scripts/assistant.py from platypush import run, when from platypush.events.assistant import ( ConversationStartEvent, SpeechRecognizedEvent ) light_plugin = "light.hue" music_plugin = "music.mopidy" @when(ConversationStartEvent) def pause_music_when_conversation_starts(): run(f"{music_plugin}.pause_if_playing") # Note: (limited) support for regular expressions on `phrase` # This hook will match any phrase containing either "turn on the lights" # or "turn off the lights" @when(SpeechRecognizedEvent, phrase="turn on (the)? lights") def lights_on_command(): run(f"{light_plugin}.on") # Or, with arguments: # run(f"{light_plugin}.on", groups=["Bedroom"]) @when(SpeechRecognizedEvent, phrase="turn off (the)? lights") def lights_off_command(): run(f"{light_plugin}.off") @when(SpeechRecognizedEvent, phrase="play (the)? music") def play_music_command(): run(f"{music_plugin}.play") @when(SpeechRecognizedEvent, phrase="stop (the)? music") def stop_music_command(): run(f"{music_plugin}.stop") Or, via YAML: # Add to your config.yaml, or to one of the files included in it event.hook.pause_music_when_conversation_starts: if: type: platypush.message.event.ConversationStartEvent then: - action: music.mopidy.pause_if_playing event.hook.lights_on_command: if: type: platypush.message.event.SpeechRecognizedEvent phrase: "turn on (the)? lights" then: - action: light.hue.on # args: # groups: # - Bedroom event.hook.lights_off_command: if: type: platypush.message.event.SpeechRecognizedEvent phrase: "turn off (the)? lights" then: - action: light.hue.off event.hook.play_music_command: if: type: platypush.message.event.SpeechRecognizedEvent phrase: "play (the)? music" then: - action: music.mopidy.play event.hook.stop_music_command: if: type: platypush.message.event.SpeechRecognizedEvent phrase: "stop (the)? music" then: - action: music.mopidy.stop Parameters are also supported on the phrase event argument through the ${} template construct. For example: from platypush import when, run from platypush.events.assistant import SpeechRecognizedEvent @when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}') def on_play_track_command( event: SpeechRecognizedEvent, title: str, artist: str ): results = run( "music.mopidy.search", filter={"title": title, "artist": artist} ) if not results: event.assistant.render_response(f"Couldn't find {title} by {artist}") return run("music.mopidy.play", resource=results[0]["uri"]) Pros 👍 Very fast and robust API. 👍 Easy to install and configure. 👍 It comes with almost all the features of a voice assistant installed on Google hardware - except some actions native to Android-based devices and video/display features. This means that features such as timers, alarms, weather forecast, setting the volume or controlling Chromecasts on the same network are all supported out of the box. 👍 It connects to your Google account (can be configured from your Google settings), so things like location-based suggestions and calendar events are available. Support for custom actions and devices configured in your Google Home app is also available out of the box, although I haven't tested it in a while. 👍 Good multi-language support. In most of the cases the assistant seems quite capable of understanding questions in multiple language and respond in the input language without any further configuration. Cons 👎 Based on a deprecated API that could break at any moment. 👎 Limited hardware support (only x86_64 and RPi 3/4). 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available. 👎 Not possible to configure the output voice - it can only use the stock Google Assistant voice. 👎 No support for intents - something similar was available (albeit tricky to configure) through the Actions SDK, but that has also been abandoned by Google. 👎 Not very modular. Both assistant.picovoice and assistant.openai have been built by stitching together different independent APIs. Those plugins are therefore quite modular. You can choose for instance to run only the hotword engine of assistant.picovoice, which in turn will trigger the conversation engine of assistant.openai, and maybe use tts.google to render the responses. By contrast, given the relatively monolithic nature of google-assistant-library, which runs the whole service locally, if your instance runs assistant.google then it can't run other assistant plugins. assistant.picovoice Plugin documentation pip installation: pip install 'platypush[assistant.picovoice]' The assistant.picovoice integration is available from Platypush 1.0.0. Previous versions had some outdated sst.picovoice.* plugins for the individual products, but they weren't properly tested and they weren't combined together into a single integration that implements the Platypush' assistant API. This integration is built on top of the voice products developed by Picovoice. These include: Porcupine: a fast and customizable engine for hotword/wake-word detection. It can be enabled by setting hotword_enabled to true in the assistant.picovoice plugin configuration. Cheetah: a speech-to-text engine optimized for real-time transcriptions. It can be enabled by setting stt_enabled to true in the assistant.picovoice plugin configuration. Leopard: a speech-to-text engine optimized for offline transcriptions of audio files. Rhino: a speech-to-intent engine. Orca: a text-to-speech engine. You can get your personal access key by signing up at the Picovoice console. You may be asked to submit a reason for using the service (feel free to mention a personal Platypush integration), and you will receive your personal access key. If prompted to select the products you want to use, make sure to select the ones from the Picovoice suite that you want to use with the assistant.picovoice plugin. A basic plugin configuration would like this: assistant.picovoice: access_key: YOUR_ACCESS_KEY # Keywords that the assistant should listen for keywords: - alexa - computer - ok google # Paths to custom keyword files # keyword_paths: # - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn # Enable/disable the hotword engine hotword_enabled: true # Enable the STT engine stt_enabled: true # conversation_start_sound: ... # Path to a custom model to be used to speech-to-text # speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv # Path to an intent model. At least one custom intent model is required if # you want to enable intent detection. # intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn Hotword detection If enabled through the hotword_enabled parameter (default: True), the assistant will listen for a specific wake word before starting the speech-to-text or intent recognition engines. You can specify custom models for your hotword (e.g. on the same device you may use "Alexa" to trigger the speech-to-text engine in English, "Computer" to trigger the speech-to-text engine in Italian, and "Ok Google" to trigger the intent recognition engine). You can also create your custom hotword models using the Porcupine console. If hotword_enabled is set to True, you must also specify the keywords parameter with the list of keywords that you want to listen for, and optionally the keyword_paths parameter with the paths to the any custom hotword models that you want to use. If hotword_enabled is set to False, then the assistant won't start listening for speech after the plugin is started, and you will need to programmatically start the conversation by calling the assistant.picovoice.start_conversation action. When a wake-word is detected, the assistant will emit a HotwordDetectedEvent that you can use to build your custom logic. By default, the assistant will start listening for speech after the hotword if either stt_enabled or intent_model_path are set. If you don't want the assistant to start listening for speech after the hotword is detected (for example because you want to build your custom response flows, or trigger the speech detection using different models depending on the hotword that is used, or because you just want to detect hotwords but not speech), then you can also set the start_conversation_on_hotword parameter to false. If that is the case, then you can programmatically start the conversation by calling the assistant.picovoice.start_conversation method in your event hooks: from platypush import when, run from platypush.message.event.assistant import HotwordDetectedEvent # Start a conversation using the Italian language model when the # "Buongiorno" hotword is detected @when(HotwordDetectedEvent, hotword='Buongiorno') def on_it_hotword_detected(event: HotwordDetectedEvent): event.assistant.start_conversation(model_file='path/to/it.pv') Speech-to-text If you want to build your custom STT hooks, the approach is the same seen for the assistant.google plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template. Speech-to-intent Intents are structured actions parsed from unstructured human-readable text. Unlike with hotword and speech-to-text detection, you need to provide a custom model for intent detection. You can create your custom model using the Rhino console. When an intent is detected, the assistant will emit an IntentRecognizedEvent and you can build your custom hooks on it. For example, you can build a model to control groups of smart lights by defining the following slots on the Rhino console: device_state: The new state of the device (e.g. with on or off as supported values) room: The name of the room associated to the group of lights to be controlled (e.g. living room, kitchen, bedroom) You can then define a lights_ctrl intent with the following expressions: "turn $device_state:state the lights" "turn $device_state:state the $room:room lights" "turn the lights $device_state:state" "turn the $room:room lights $device_state:state" "turn $room:room lights $device_state:state" This intent will match any of the following phrases: "turn on the lights" "turn off the lights" "turn the lights on" "turn the lights off" "turn on the living room lights" "turn off the living room lights" "turn the living room lights on" "turn the living room lights off" And it will extract any slots that are matched in the phrases in the IntentRecognizedEvent. Train the model, download the context file, and pass the path on the intent_model_path parameter. You can then register a hook to listen to a specific intent: from platypush import when, run from platypush.events.assistant import IntentRecognizedEvent @when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'}) def on_turn_on_lights(event: IntentRecognizedEvent): room = event.slots.get('room') if room: run("light.hue.on", groups=[room]) else: run("light.hue.on") Note that if both stt_enabled and intent_model_path are set, then both the speech-to-text and intent recognition engines will run in parallel when a conversation is started. The intent engine is usually faster, as it has a smaller set of intents to match and doesn't have to run a full speech-to-text transcription. This means that, if an utterance matches both a speech-to-text phrase and an intent, the IntentRecognizedEvent event is emitted (and not SpeechRecognizedEvent). This may not be always the case though. So, if you want to use the intent detection engine together with the speech detection, it may be a good practice to also provide a fallback SpeechRecognizedEvent hook to catch the text if the speech is not recognized as an intent: from platypush import when, run from platypush.events.assistant import SpeechRecognizedEvent @when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?') def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context): if room: run("light.hue.on", groups=[room]) else: run("light.hue.on") Text-to-speech and response management The text-to-speech engine, based on Orca, is provided by the tts.picovoice plugin. However, the Picovoice integration won't provide you with automatic AI-generated responses for your queries. That's because Picovoice doesn't seem to offer (yet) any products for conversational assistants, either voice-based or text-based. You can however leverage the render_response action to render some text as speech in response to a user command, and that in turn will leverage the Picovoice TTS plugin to render the response. For example, the following snippet provides a hook that: Listens for SpeechRecognizedEvent. Matches the phrase against a list of predefined commands that shouldn't require an AI-generated response. Has a fallback logic that leverages openai.get_response to generate a response through a ChatGPT model and render it as audio. Also, note that any text rendered over the render_response action that ends with a question mark will automatically trigger a follow-up - i.e. the assistant will wait for the user to answer its question. import re from platypush import hook, run from platypush.message.event.assistant import SpeechRecognizedEvent def play_music(): run("music.mopidy.play") def stop_music(): run("music.mopidy.stop") def ai_assist(event: SpeechRecognizedEvent): response = run("openai.get_response", prompt=event.phrase) if not response: return run("assistant.picovoice.render_response", text=response) # List of commands to match, as pairs of regex patterns and the # corresponding actions hooks = ( (re.compile(r"play (the)?music", re.IGNORECASE), play_music), (re.compile(r"stop (the)?music", re.IGNORECASE), stop_music), # ... # Fallback to the AI assistant (re.compile(r".*"), ai_assist), ) @when(SpeechRecognizedEvent) def on_speech_recognized(event, **kwargs): for pattern, command in hooks: if pattern.search(event.phrase): run("logger.info", msg=f"Running voice command: {command.__name__}") command(event, **kwargs) break Offline speech-to-text An assistant.picovoice.transcribe action is provided for offline transcriptions of audio files, using the Leopard models. You can easily call it from your procedures, hooks or through the API: $ curl -XPOST -H "Authorization: Bearer $TOKEN" -d ' { "type": "request", "action": "assistant.picovoice.transcribe", "args": { "audio_file": "/path/to/some/speech.mp3" } }' http://localhost:8008/execute { "transcription": "This is a test", "words": [ { "word": "this", "start": 0.06400000303983688, "end": 0.19200000166893005, "confidence": 0.9626294374465942 }, { "word": "is", "start": 0.2879999876022339, "end": 0.35199999809265137, "confidence": 0.9781675934791565 }, { "word": "a", "start": 0.41600000858306885, "end": 0.41600000858306885, "confidence": 0.9764975309371948 }, { "word": "test", "start": 0.5120000243186951, "end": 0.8320000171661377, "confidence": 0.9511580467224121 } ] } Pros 👍 The Picovoice integration is extremely configurable. assistant.picovoice stitches together five independent products developed by a small company specialized in voice products for developers. As such, Picovoice may be the best option if you have custom use-cases. You can pick which features you need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you have plenty of flexibility in building your integrations. 👍 Runs (or seems to run) (mostly) on device. This is something that we can't say about the other two integrations discussed in this article. If keeping your voice interactions 100% hidden from Google's or Microsoft's eyes is a priority, then Picovoice may be your best bet. 👍 Rich features. It uses different models for different purposes - for example, Cheetah models are optimized for real-time speech detection, while Leopard is optimized for offline transcription. Moreover, Picovoice is the only integration among those analyzed in this article to support speech-to-intent. 👍 It's very easy to build new models or customize existing ones. Picovoice has a powerful developers console that allows you to easily create hotword models, tweak the priority of some words in voice models, and create custom intent models. Cons 👎 The business model is still a bit weird. It's better than the earlier "write us an email with your business case and we'll reach back to you", but it still requires you to sign up with a business email and write a couple of lines on what you want to build with their products. It feels like their focus is on a B2B approach rather than "open up and let the community build stuff", and that seems to create unnecessary friction. 👎 No native conversational features. At the time of writing, Picovoice doesn't offer products that generate AI responses given voice or text prompts. This means that, if you want AI-generated responses to your queries, you'll have to do requests to e.g. openai.get_response(prompt) directly in your hooks for SpeechRecognizedEvent, and render the responses through assistant.picovoice.render_response. This makes the use of assistant.picovoice alone more fit to cases where you want to mostly create voice command hooks rather than have general-purpose conversations. 👎 Speech-to-text, at least on my machine, is slower than the other two integrations, and the accuracy with non-native accents is also much lower. 👎 Limited support for any languages other than English. At the time of writing hotword detection with Porcupine seems to be in a relative good shape with support for 16 languages. However, both speech-to-text and text-to-speech only support English at the moment. 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for example, doesn't even support text that includes digits or some punctuation characters - at least not at the time of writing. The Platypush integration fills the gap with workarounds that e.g. replace words to numbers and replace punctuation characters, but you definitely have a feeling that some parts of their products are still work in progress. assistant.openai Plugin documentation pip installation: pip install 'platypush[assistant.openai]' This integration has been released in Platypush 1.0.7. It uses the following OpenAI APIs: /audio/transcriptions for speech-to-text. At the time of writing the default model is whisper-1. It can be configured through the model setting on the assistant.openai plugin configuration. See the OpenAI documentation for a list of available models. /chat/completions to get AI-generated responses using a GPT model. At the time of writing the default is gpt-3.5-turbo, but it can be configurable through the model setting on the openai plugin configuration. See the OpenAI documentation for a list of supported models. /audio/speech for text-to-speech. At the time of writing the default model is tts-1 and the default voice is nova. They can be configured through the model and voice settings respectively on the tts.openai plugin. See the OpenAI documentation for a list of available models and voices. You will need an OpenAI API key associated to your account. A basic configuration would like this: openai: api_key: YOUR_OPENAI_API_KEY # Required # conversation_start_sound: ... # model: ... # context: ... # context_expiry: ... # max_tokens: ... assistant.openai: # model: ... # tts_plugin: some.other.tts.plugin tts.openai: # model: ... # voice: ... If you want to build your custom hooks on speech events, the approach is the same seen for the other assistant plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template. Hotword support OpenAI doesn't provide an API for hotword detection, nor a small model for offline detection. This means that, if no other assistant plugins with stand-alone hotword support are configured (only assistant.picovoice for now), a conversation can only be triggered by calling the assistant.openai.start_conversation action. If you want hotword support, then the best bet is to add assistant.picovoice to your configuration too - but make sure to only enable hotword detection and not speech detection, which will be delegated to assistant.openai via event hook: assistant.picovoice: access_key: ... keywords: - computer hotword_enabled: true stt_enabled: false # conversation_start_sound: ... Then create a hook that listens for HotwordDetectedEvent and calls assistant.openai.start_conversation: from platypush import run, when from platypush.events.assistant import HotwordDetectedEvent @when(HotwordDetectedEvent, hotword="computer") def on_hotword_detected(): run("assistant.openai.start_conversation") Conversation contexts The most powerful feature offered by the OpenAI assistant is the fact that it leverages the conversation contexts provided by the OpenAI API. This means two things: Your assistant can be initialized/tuned with a static context. It is possible to provide some initialization context to the assistant that can fine tune how the assistant will behave, (e.g. what kind of tone/language/approach will have when generating the responses), as well as initialize the assistant with some predefined knowledge in the form of hypothetical past conversations. Example:openai: # ... context: # `system` can be used to initialize the context for the expected tone # and language in the assistant responses - role: system content: > You are a voice assistant that responds to user queries using references to Lovecraftian lore. # `user`/`assistant` interactions can be used to initialize the # conversation context with previous knowledge. `user` is used to # emulate previous user questions, and `assistant` models the # expected response. - role: user content: What is a telephone? - role: assistant content: > A Cthulhuian device that allows you to communicate with otherworldly beings. It is said that the first telephone was created by the Great Old Ones themselves, and that it is a gateway to the void beyond the stars. If you now start Platypush and ask a question like "how does it work?", the voice assistant may give a response along the lines of: The telephone functions by harnessing the eldritch energies of the cosmos to transmit vibrations through the ether, allowing communication across vast distances with entities from beyond the veil. Its operation is shrouded in mystery, for it relies on arcane principles incomprehensible to mortal minds. Note that: The style of the response is consistent with that initialized in the context through system roles. Even though a question like "how does it work?" is not very specific, the assistant treats the user/assistant entries given in the context as if they were the latest conversation prompts. Thus it realizes that "it", in this context, probably means "the telephone". The assistant has a runtime context. It will remember the recent conversations for a given amount of time (configurable through the context_expiry setting on the openai plugin configuration). So, even without explicit context initialization in the openai plugin, the plugin will remember the last interactions for (by default) 10 minutes. So if you ask "who wrote the Divine Comedy?", and a few seconds later you ask "where was its writer from?", you may get a response like "Florence, Italy" - i.e. the assistant realizes that "the writer" in this context is likely to mean "the writer of the work that I was asked about in the previous interaction" and return pertinent information. Pros 👍 Speech detection quality. The OpenAI speech-to-text features are the best among the available assistant integrations. The transcribe API so far has detected my non-native English accent right nearly 100% of the times (Google comes close to 90%, while Picovoice trails quite behind). And it even detects the speech of my young kid - something that the Google Assistant library has always failed to do right. 👍 Text-to-speech quality. The voice models used by OpenAI sound much more natural and human than those of both Google and Picovoice. Google's and Picovoice's TTS models are actually already quite solid, but OpenAI outclasses them when it comes to voice modulation, inflections and sentiment. The result sounds intimidatingly realistic. 👍 AI responses quality. While the scope of the Google Assistant is somewhat limited by what people expected from voice assistants until a few years ago (control some devices and gadgets, find my phone, tell me the news/weather, do basic Google searches...), usually without much room for follow-ups, assistant.openai will basically render voice responses as if you were typing them directly to ChatGPT. While Google would often respond you with a "sorry, I don't understand", or "sorry, I can't help with that", the OpenAI assistant is more likely to expose its reasoning, ask follow-up questions to refine its understanding, and in general create a much more realistic conversation. 👍 Contexts. They are an extremely powerful way to initialize your assistant and customize it to speak the way you want, and know the kind of things that you want it to know. Cross-conversation contexts with configurable expiry also make it more natural to ask something, get an answer, and then ask another question about the same topic a few seconds later, without having to reintroduce the assistant to the whole context. 👍 Offline transcriptions available through the openai.transcribe action. 👍 Multi-language support seems to work great out of the box. Ask something to the assistant in any language, and it'll give you a response in that language. 👍 Configurable voices and models. Cons 👎 The full pack of features is only available if you have an API key associated to a paid OpenAI account. 👎 No hotword support. It relies on assistant.picovoice for hotword detection. 👎 No intents support. 👎 No native support for weather forecast, alarms, timers, integrations with other services/devices nor other features available out of the box with the Google Assistant. You can always create hooks for them though. Weather forecast example Both the OpenAI and Picovoice integrations lack some features available out of the box on the Google Assistant - weather forecast, news playback, timers etc. - as they rely on voice-only APIs that by default don't connect to other services. However Platypush provides many plugins to fill those gaps, and those features can be implemented with custom event hooks. Let's see for example how to build a simple hook that delivers the weather forecast for the next 24 hours whenever the assistant gets a phrase that contains the "weather today" string. You'll need to enable a weather plugin in Platypush - weather.openweathermap will be used in this example. Configuration: weather.openweathermap: token: OPENWEATHERMAP_API_KEY location: London,GB Then drop a script named e.g. weather.py in the Platypush scripts directory (default: /scripts) with the following content: from datetime import datetime from textwrap import dedent from time import time from platypush import run, when from platypush.events.assistant import SpeechRecognizedEvent @when(SpeechRecognizedEvent, phrase='weather today') def weather_forecast(event: SpeechRecognizedEvent): limit = time() + 24 * 60 * 60 # 24 hours from now forecast = [ weather for weather in run("weather.openweathermap.get_forecast") if datetime.fromisoformat(weather["time"]).timestamp() < limit ] min_temp = round( min(weather["temperature"] for weather in forecast) ) max_temp = round( max(weather["temperature"] for weather in forecast) ) max_wind_gust = round( (max(weather["wind_gust"] for weather in forecast)) * 3.6 ) summaries = [weather["summary"] for weather in forecast] most_common_summary = max(summaries, key=summaries.count) avg_cloud_cover = round( sum(weather["cloud_cover"] for weather in forecast) / len(forecast) ) event.assistant.render_response( dedent( f""" The forecast for today is: {most_common_summary}, with a minimum of {min_temp} and a maximum of {max_temp} degrees, wind gust of {max_wind_gust} km/h, and an average cloud cover of {avg_cloud_cover}%. """ ) ) This script will work with any of the available voice assistants. You can also implement something similar for news playback, for example using the rss plugin to get the latest items in your subscribed feeds. Or to create custom alarms using the alarm plugin, or a timer using the utils.set_timeout action. Conclusions The past few years have seen a lot of things happen in the voice industry. Many products have gone out of market, been deprecated or sunset, but not all hope is lost. The OpenAI and Picovoice products, especially when combined together, can still provide a good out-of-the-box voice assistant experience. And the OpenAI products have also raised the bar on what to expect from an AI-based assistant. I wish that there were still some fully open and on-device alternatives out there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google provide the best voice experience as of now, but of course they come with trade-offs - namely the great amount of data points you feed to these cloud-based services. Picovoice is somewhat a trade-off, as it runs at least partly on-device, but their business model is still a bit fuzzy and it's not clear whether they intend to have their products used by the wider public or if it's mostly B2B. I'll keep an eye however on what is going to come from the ashes of Mycroft under the form of the OpenConversational project, and probably keep you up-to-date when there is a new integration to share.

Fabio Manganiello Apr 07, 2024 @ 00:00

blog.fabiomanganiello.com

Some progress on the state of speech detection in Platypush (powered by Picovoice)

I've picked up some development on Picovoice in these days as I'm rewriting some Platypush integrations that haven't been touched in a long time (and Picovoice is among those). I originally worked with their APIs about 4-5 years ago, when I did some research on STT engines for Platypush. Back then I kind of overlooked Picovoice. It wasn't very well documented, the APIs were a bit clunky, and their business model was based on a weird "send us an email with your use-case and we'll reach back to you" (definitely not the kind of thing you'd want other users to reuse with their own accounts and keys). Eventually I did just enough work to get the basics to work, and then both my article 1 and article 2 on voice assistants focused more on other solutions - namely Google Assistant, Alexa, Snowboy, Mozilla DeepSpeech and Mycroft's models. A couple of years down the line: Snowboy is dead Mycroft is dead Mozilla DeepSpeech isn't officially dead, but it hasn't seen a commit in 3 years Amazon's AVS APIs have become clunky and it's basically impossible to run any logic outside of Amazon's cloud The Google Assistant library has been deprecated without a replacement. It still works on Platypush after I hammered it a lot (especially when it comes to its dependencies from 5-6 years ago), but it only works on x86_64 and Raspberry Pi 3/4 (not aarch64). So I was like "ok, let's give Picovoice another try". And I must say that I'm impressed by what I've seen. The documentation has improved a lot. The APIs are much more polished. They also have a Web console that you can use to train your hotword models and intents logic - no coding involved, similar to what Snowboy used to have. The business model is still a bit weird, but at least now you can sign up from a Web form (and still explain what you want to use Picovoice products for), and you immediately get an access key to start playing on any platform. And the product isn't fully open-source either (only the API bindings are). But at first glance it seems that most of the processing (if not all, with the exception of authentication) happens on-device - and that's a big selling point. Most of all, the hotword models are really good. After a bit of plumbing with sounddevice, I've managed to implement a real-time hotword detection on Platypush that works really well. The accuracy is comparable to that of Google Assistant's, while supporting many more hotwords and being completely offline. Latency is very low, and the CPU usage is minimal even on a Raspberry Pi 4. I also like the modular architecture of the project. You can use single components (Porcupine for hotword detection, Cheetah for speech detection from stream, Leopard for speech transcription, Rhino for intent parsing...) in order to customize your assistant with the features that you want. I'm now putting together a new Picovoice integration for Platypush that, rather than having separate integrations for hotword detection and STT, wires everything together, enables intent detection and provides TTS rendering too (it depends on what's the current state of the TTS products on Picovoice). I'll write a new blog article when ready. In the meantime, you can follow the progress on the Picovoice branch.

Fabio Manganiello Jul 07, 2020 @ 00:00

blog.platypush.tech

One web extension to rule them all

Once upon a time, there was a worldwide web where web extensions were still new toys to play with and the major browsers that supported them (namely Firefox and Chrome) didn’t mind providing them with very wide access to their internals and APIs to do (more or less) whatever they pleased. The idea was that these browser add-ons/apps/extensions (the lines between these were still quite blurry at the time) could become a powerful way to run within a browser (even locally and without connecting to another website) any piece of software the user wanted to run. It was an age when powerful extensions spawned that could also deeply change many things in the browser (like the now-defunct Vimperator that could completely redesign the UI of the browser to make it look and behave like vim), and user scripts were a powerful way users could leverage to run anything they liked wherever they liked. I used to use Vimperator custom scripts a lot to map whichever sequence of keys I wanted to whichever custom action I wanted — just modeled as plain JavaScript. And I used to use user scripts a lot, as well — those still exist, but with many more limitations than before. That age of wild West of web extensions and apps is largely gone by now. It didn’t take long before malicious actors realized that the freedom given to web extensions made them a perfect vector to run malware/spyware directly within the browser that, in many cases, could bypass several anti-malware layers. And that generation of web extensions had another issue with fragmentation. Firefox and Chrome had developed their own APIs (like Mozilla’s XUL and Chrome Apps) that didn’t have much overlap. That made the task of developing a web extension that targeted multiple browsers a very expensive experience, and many extensions and apps were only available for a particular browser. The case for greater security, separation of concerns, and less fragmentation drove the migration towards the modern WebExtension API. Around the end of 2017, both Mozilla and Google ended the support for the previous APIs in the respective browsers. They also added more restrictions for the add-ons and scripts not approved on their stores ( recent versions of Firefox only allow you to permanently install extensions published on the store) and added more constraints and checks in their review processes. The new API has made it harder for malicious actors to hack a user through the browser, and it also has greatly reduced the barriers required to develop a cross-browser extension. On the other hand, however, it has also greatly reduced the degrees of freedom offered to extensions. Several extensions that required deep integration with the browser (like Vimperator and Postman) decided to either migrate to stand-alone apps or just abandon their efforts. And user scripts have become more niche geeky features with more limitations than before offered by third-party extensions like Greasemonkey/Tampermonkey. Firefox’s recent user-scripts API is a promising alternative for reviving the power of the past wave, but so far it’s only supported by Firefox. As a power user, while I understand all the motivations that led browser developers to the decision of more fencing/sandboxing for the extensions, I still miss those times when we could deeply customize our browser and what it could do however we liked it. I built Platypush over the years to solve my need for endless extensibility and customization on the backend side, with everything provided by a uniform and coherent API and platform. I thought that applying the same philosophy to the context of my web browser would have been the natural next step. With the Platypush web extension, I’ve tried to build a solution for several needs faced by many power users. First, we’ve got several backend solutions to run things around, and smart home devices to do things and pass information around. But the dear ol’ desktop web browser has often been left behind in this progress in automation, even if many people still spend a lot of time on the web through desktop devices. Most of the front-end solutions for cloud/home automation come through mobile apps. Some of the solutions for automation provide a web app/panel (and Platypush does it as well), but the web panel is receiving less and less attention in an increasingly mobile-centric world. And even when your solution provides a web app, there’s another crucial factor to take into account: the time to action. How much time passes between you thinking “I’d like to run this action on that device” and the action actually being executed on that device? And remember that, especially when it comes to smart devices, the time-to-action in the “smart” way (like you toggling a light-bulb remotely) should never be longer than the time-to-action in the “dumb” way (like you standing up and toggling a switch). That’s your baseline. When I’m doing some work on my laptop I may sometimes want to run some action on another device — like send a link to my phone, turn on the lights or the fan, play the video currently playing on the laptop on my media center, play the Spotify playlist playing in my bedroom in my living room — or the other way around — and so on. Sure, for some of these problems there’s a Platypush/HomeAssistant/OpenHAB/BigCorp Inc. front-end solution, but that usually involves either you getting the hands off your laptop to grab your phone, or opening/switching to the tab with the web app provided by your platform, searching for the right menu/option, scrolling a bit, and then running the action. Voice assistants are another option (and Platypush provides integrations that give you access to many of the voice technologies around), but talking your way through the day to run anything isn’t yet the frictionless and fast process many want — nor it should be the only way. Minimizing the time-to-action for me means to be able to run that action on the fly (ideally within a maximum of three clicks or keystrokes) from any tab or from the toolbar itself, regardless of the action. Sure, there are some web extensions to solve some of those problems. But that usually involves: Relying on someone else’s solution for your problem, and that solution isn’t necessarily the most optimal for your use case. Polluting your browser with lots of extensions in order to execute different types of actions. Sending links to other devices may involve installing the Pushbullet/Join extension, playing media on Kodi another extension, playing media on the Chromecast another extension, saving links to Instapaper/Evernote/Pocket or other extensions, sharing on Twitter/Facebook yet more extensions, controlling your smart home hub yet another extension… and the list goes on, until your browser’s toolbar is packed with icons, and you can’t even recall what some of them do — defeating the whole purpose of optimizing the time-to-action from the context of the web browser. And, of course, installing too many extensions increases the potential area of surface for attacks against your browser — and that’s the problem that the WebExtensions API was supposed to solve in the first place. I first started this journey by building a simple web extension that I could use to quickly debug Platypush commands executed on other RaspberryPis and smart devices around my house over web API/websocket/MQTT. Then, I realized that I could use the same solution to solve my problem of optimizing the time-to-action — i.e. the problem of “I want to switch on the lights right now without either grabbing my phone or switching tabs or standing up, while I’m working on my Medium article on the laptop.” And that means either from the toolbar itself (preferably with all the actions grouped under the same extension button and UI) or through the right-click context menu, like a native browser action. The ability to run any Platypush action from my browser on any remote device meant that I could control any device or remote API from the same interface, as long as there is a Platypush plugin to interact with that device/API. But that target wasn’t enough for me yet. Not all the actions that I may want to run on the fly from whichever location in the browser could be translated to an atomic Platypush action. Platypush remote procedures can surely help with running more complex logic on the backend, but I wanted the extension to also cover my use cases that require interaction with the browser context — things like “play this video on my Chromecast (yes, even if I’m on Firefox)”, “translate this page and make sure that the result doesn’t look like a 1997 website (yes, even if I’m on Firefox)”, “download this Magnet link directly on my NAS”, and so on. All the way up to custom event hooks that could react to Platypush events triggered by other devices with custom logic running in the browser — things like “synchronize the clipboard on the laptop if another Platypush device sends a ClipboardEvent”, “send a notification to the browser with the spoken text when the Google Assistant plugin triggers a ResponseEvent” , or when a sensor goes above a certain threshold, and so on. I wanted the ability to define all of these actions through a JavaScript native API similar to that provided by Greasemonkey/Tampermonkey. But while most of the user scripts provided by those extensions only run within the context of a web page, I wanted to decouple my script snippets from the web page and build an API that provides access to both the browser context, to the Platypush actions available on any other remote device, to run background code in response to custom events, and to synchronize the configuration easily across devices. So let’s briefly go through the extension to see what you can do with it. Installation and usage First, you need a Platypush service running somewhere. If you haven’t tried it before, refer to any of the links in the previous sections to get started (I’ve made sure that installing, configuring, and starting a base environment doesn’t take longer than five minutes, I promise :) ). Also, make sure that you enable the HTTP backend in the config.yaml, as the webserver is the channel used by the extension to communicate with the server. Once you have a Platypush instance running on e.g. a RaspberryPi, another server or your laptop, get the web extension: Firefox link Chrome link You can also build an extension from sources. First, make sure that you have npm installed, then clone the repo: git clone https://git.platypush.tech/platypush/platypush-webext Install the dependencies and build the extension: npm install npm run build At the end of the process, you should have a dist folder with a manifest.json. In Chrome (or any Chromium-based browser), go to Extensions -> Load Unpacked and select the dist folder. In Firefox, go to about:debugging -> This Firefox -> Load Temporary Add-on and select the manifest.json file. Note that recent versions of Firefox only support unpacked extensions (i.e. any extension not loaded on the Firefox add-ons website) through about:debugging. This means that any temporary extension will be lost when the browser is restarted — however, restoring the configuration of the Platypush extension when it’s reinstalled is a very quick process. Once installed in the browser, the extension icon will appear in the toolbar. Web extension screenshot 1 Click on the available link to open the extension configuration tab and add your Platypush device in the configuration. Web extension screenshot 2 Once the device is added, click on its name from the menu and select Run Action. Web extension screenshot 3 The run tab comes with two modes: request and script mode. In request mode, you can run actions directly on a remote Platypush device through a dynamic interface. You’ve got a form with an autocomplete menu that displays all the actions available on your device, and upon selection, the form is pre-populated with all the arguments available for that action, their default values, and description. This interface is very similar to the execute tab provided by the Platypush web panel, and it makes it super easy to quickly test and run commands on another host. You can use this interface to run any action on any remote device as long as there’s a plugin installed and configured for it — file system management, media center controls, voice assistants, cameras, switches, getting data from sensors, managing cloud services, you name it. You can also run procedures stored on the remote device — their action names start with procedure — and you can also pass the URL in the active tab to action as an argument by using the special variable $URL$ as an action value. For instance, you can use it to create an action that sends the current URL to your mobile device through pushbullet.send_note, with both body and url set to $URL$. Once you’re happy with your action, you can save it so it’s available both from the toolbar and the browser context menu. You can also associate keybindings to your actions, so you can run them in your browser from any tab with just a few keystrokes. The mappings are in the form , with n between 0 and 9 - however, Chrome-based browsers limit the number of keybindings per extension to a maximum of 4, for some odd reason that I completely ignore. If you only needed a way to execute Platypush actions remotely from your browser, this is actually all you need. The action will now be available from the extension toolbar: Web extension screenshot 4 And from the context menu: Web extension screenshot 5 You can easily debug/edit stored action from the Stored Action tab in the extension’s configuration page. Script mode The other (and most powerful) way to define custom actions is through scripts. Scripts can be used to glue together the Platypush API (or any other API) and the browser API. Select Script from the selector on the top of the Run Action tab. You will be presented with a JavaScript editor with a pre-loaded script template: Web extension screenshot 6 The page also provides a link to a Gist showing examples for all the available pieces of the API. In a nutshell, these are the most important pieces you can use to build your user scripts: args includes relevant context information for your scripts, such as the target Platypush host, the tabId, and the target element, if the action was called from a context menu on a page. app exposes the API available to the script. Among the methods exposed by app: app.getURL returns the URL in the active tab. app.setURL changes the URL rendered in the active tab, while app.openTab opens a URL in a new tab. app.notify(message, title) displays a browser notification. app.run executes actions on a remote Platypush device. For example, this is a possible action to cast YouTube videos to the default Chromecast device: // Platypush user script to play the current URL // on the Chromecast if it is a YouTube URL. async (app, args) => { const url = await app.getURL(); if (!url.startsWith('https://www.youtube.com/watch?v=')) { return; } const response = await app.run({ action: 'media.chromecast.play', args: { resource: url, }, }, args.host); if (response.success) { app.notify('YouTube video now playing on Chromecast'); } } app.axios.[get|post|put|delete|patch|head|options]: The API also exposes the Axios API to perform custom AJAX calls to remote endpoints. For example, if you want to save the current URL to your Instapaper account:// Sample Platypush user script to save the current URL to Instapaper async (app, args) => { const url = await app.getURL(); const response = await app.axios.get('https://www.instapaper.com/api/add', { params: { url: url, username: '********@****.***', password: '******', }, }); const targetURL = `https://instapaper.com/read/${response.data.bookmark_id}`; app.openTab(targetURL); } app.getDOM returns the DOM/content of the current page (as a Node element), while app.setDOM replaces the DOM/content of the page (given as a string). For example, you can combine the provided DOM API with the Platypush Translate plugin to translate a web page on the fly:// Platypush user script to translate a web page through the Google Translate API async (app, args) => { const dom = await app.getDOM(); // Translate the page through the Platypush Google Translate plugin // (https://docs.platypush.tech/en/latest/platypush/plugins/google.translate.html). // The plugin also splits the HTML in multiple requests if too long // to circumvent Google's limit on maximum input text. const response = await app.run({ action: 'google.translate.translate', args: { text: dom.body.innerHTML, format: 'html', target_language: 'en', } }, args.host); // The new body will contain a

with the translated HTML, // a hidden

with the original HTML and a top fixed button // to switch back to the original page. const translatedDiv = `

${response.translated_text}

${dom.body.innerHTML}

`; const style = ` `; // Reconstruct the DOM and change it. dom.head.innerHTML += style; dom.body.innerHTML = translatedDiv; await app.setDOM(`${dom.getElementsByTagName('html')[0].innerHTML}`); } The extension API also exposes the Mercury Reader API to simplify/distill the content of a web page. You can combine the elements seen so far into a script that simplifies the content of a web page for better readability or to make it more printer-friendly:// Platypush sample user script to simplify/distill the content of a web page async (app, args) => { const url = await app.getURL(); // Get and parse the page body through the Mercury API const dom = await app.getDOM(); const html = dom.body.innerHTML; const response = await app.mercury.parse(url, html); // Define a new DOM that contains the simplified body as well as // the original body as a hidden

, and provide a top fixed // button to switch back to the original content. const style = ` `; const simplifiedDiv = `

${response.title}

${response.content}

${dom.body.innerHTML}

`; // Construct and replace the DOM dom.head.innerHTML += style; dom.body.innerHTML = simplifiedDiv; await app.setDOM(`${dom.getElementsByTagName('html')[0].innerHTML}`); } Finally, you can access the target element if you run the action through a context menu (for example, right-click on an item on the page). Because of WebExtensions API limitations (which can only pass JSON-serializable objects around), the target element is passed on the args as a string, but you can easily convert it to a DOM object (and you can convert any HTML to DOM) through the app.HTML2DOM method. For example, you can extend the initial YouTube to Chromecast user script to cast any audio or video item present on a page:// Sample Platypush user script to cast the current tab or any media item selected // on the page to the default Chromecast device configured in Platypush. async (app, args) => { const baseURL = await app.getURL(); // Default URL to cast: current page URL let url = baseURL; if (args.target) { // The user executed the action from a context menu const target = app.HTML2DOM(args.target); // If it's a

Build custom voice assistants https://blog.platypush.tech/article/Build-custom-voice-assistants

An overview of the current technologies and how to leverage Platypush to build your customized assistant.

The Case for DIY Voice Assistants

Overview of the voice assistant integrations

Native Google Assistant library

Integrations

Configuration

Features

Pros

Cons

Google Assistant Push-To-Talk Integration

Integrations

Configuration

Features

Pros

Cons

Alexa Integration

Integrations

Configuration

Features

Pros

Cons

Snowboy Integration

Integrations

Configuration

Features

Pros

Cons

Mozilla DeepSpeech

Integrations

Configuration

Features

Pros

Cons

PicoVoice

Integrations

Configuration

Features

Pros

Cons

Conclusions

Reactions

Webmentions

ActivityPub

The pipeline

Setup

Models

Hotword Detection

Speech-to-text

Text-to-speech

Configuration

Home automation plugins

Build

Run

Linux

macOS

Windows

Usage

Extending the Assistant

Starting a conversation

Deterministic commands

AI Commands

Speech to Intent

Response fallback

Pausing music while listening

Going fully local

Why this architecture ages well

Final notes

${response.title}

Build custom voice assistants