Platypush

How to use Platypush to build your voice assistants. Featuring Google, OpenAI and Picovoice.

Author photo Platypush
.

Those who have been following my blog or used Platypush for a while probably know that I've put quite some efforts to get voice assistants rights over the past few years.

I built my first (very primitive) voice assistant that used DCT+Markov models back in 2008, when the concept was still pretty much a science fiction novelty.

Then I wrote an article in 2019 and one in 2020 on how to use several voice integrations in Platypush to create custom voice assistants.

Everyone in those pictures is now dead

Quite a few things have changed in this industry niche since I wrote my previous article. Most of the solutions that I covered back in the day, unfortunately, are gone in a way or another:

  • The assistant.snowboy integration is gone because unfortunately Snowboy is gone. For a while you could still run the Snowboy code with models that either you had previously downloaded from their website or trained yourself, but my latest experience proved to be quite unfruitful - it's been more than 4 years since the last commit on Snowboy, and it's hard to get the code to even run.

  • The assistant.alexa integration is also gone, as Amazon has stopped maintaining the AVS SDK. And I have literally no clue of what Amazon's plans with the development of Alexa skills are (if there are any plans at all).

  • The stt.deepspeech integration is also gone: the project hasn't seen a commit in 3 years and I even struggled to get the latest code to run. Given the current financial situation at Mozilla, and the fact that they're trying to cut as much as possible on what they don't consider part of their core product, it's very unlikely that DeepSpeech will be revived any time soon.

  • The assistant.google integration is still there, but I can't make promises on how long it can be maintained. It uses the google-assistant-library, which was deprecated in 2019. Google replaced it with the conversational actions, which was also deprecated last year. <rant>Put here your joke about Google building products with the shelf life of a summer hit.</rant>

  • The tts.mimic3 integration, a text model based on mimic3, part of the Mycroft initiative, is still there, but only because it's still possible to spin up a Docker image that runs mimic3. The whole Mycroft project, however, is now defunct, and the story of how it went bankrupt is a very sad story about the power that patent trolls have on startups. The Mycroft initiative however seems to have been picked up by the community, and something seems to move in the space of fully open source and on-device voice models. I'll definitely be looking with interest at what happens in that space, but the project seems to be at a stage that is still a bit immature to justify an investment into a new Platypush integration.

But not all hope is lost

assistant.google

assistant.google may be relying on a dead library, but it's not dead (yet). The code still works, but you're a bit constrained on the hardware side - the assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3 and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other ARMv7-compatible devices has proved to be a challenge in some cases. Given the state of the library, it's safe to say that it'll never be supported on other platforms, but if you want to run your assistant on a device that is still supported then it should still work fine.

I had however to do a few dirty packaging tricks to ensure that the assistant library code doesn't break badly on newer versions of Python. That code hasn't been touched in 5 years and it's starting to rot. It depends on ancient and deprecated Python libraries like enum34 and it needs some hammering to work - without breaking the whole Python environment in the process.

For now, pip install 'platypush[assistant.google]' should do all the dirty work and get all of your assistant dependencies installed. But I can't promise I can maintain that code forever.

assistant.picovoice

Picovoice has been a nice surprise in an industry niche where all the products that were available just 4 years ago are now dead.

I described some of their products in my previous articles, and I even built a couple of stt.picovoice.* plugins for Platypush back in the day, but I didn't really put much effort in it.

Their business model seemed a bit weird - along the lines of "you can test our products on x86_64, if you need an ARM build you should contact us as a business partner". And the quality of their products was also a bit disappointing compared to other mainstream offerings.

I'm glad to see that the situation has changed quite a bit now. They still have a "sign up with a business email" model, but at least now you can just sign up on their website and start using their products rather than sending emails around. And I'm also quite impressed to see the progress on their website. You can now train hotword models, customize speech-to-text models and build your own intent rules directly from their website - a feature that was also available in the beloved Snowboy and that went missing from any major product offerings out there after Snowboy was gone. I feel like the quality of their models has also greatly improved compared to the last time I checked them - predictions are still slower than the Google Assistant, definitely less accurate with non-native accents, but the gap with the Google Assistant when it comes to native accents isn't very wide.

assistant.openai

OpenAI has filled many gaps left by all the casualties in the voice assistants market. Platypush now provides a new assistant.openai plugin that stitches together several of their APIs to provide a voice assistant experience that honestly feels much more natural than anything I've tried in all these years.

Let's explore how to use these integrations to build our on-device voice assistant with custom rules.

Feature comparison

As some of you may know, voice assistant often aren't monolithic products. Unless explicitly designed as all-in-one packages (like the google-assistant-library), voice assistant integrations in Platypush are usually built on top of four distinct APIs:

  1. Hotword detection: This is the component that continuously listens on your microphone until you speak "Ok Google", "Alexa" or any other wake-up word used to start a conversation. Since it's a continuously listening component that needs to take decisions fast, and it only has to recognize one word (or in a few cases 3-4 more at most), it usually doesn't need to run on a full language model. It needs small models, often a couple of MBs heavy at most.

  2. Speech-to-text (STT): This is the component that will capture audio from the microphone and use some API to transcribe it to text.

  3. Response engine: Once you have the transcription of what the user said, you need to feed it to some model that will generate some human-like response for the question.

  4. Text-to-speech (TTS): Once you have your AI response rendered as a text string, you need a text-to-speech model to speak it out loud on your speakers or headphones.

On top of these basic building blocks for a voice assistant, some integrations may also provide two extra features.

Speech-to-intent

In this mode, the user's prompt, instead of being transcribed directly to text, is transcribed into a structured intent that can be more easily processed by a downstream integration with no need for extra text parsing, regular expressions etc.

For instance, a voice command like "turn off the bedroom lights" could be translated into an intent such as:

{
  "intent": "lights_ctrl",
  "slots": {
    "state": "off",
    "lights": "bedroom"
  }
}

Offline speech-to-text

a.k.a. offline text transcriptions. Some assistant integrations may offer you the ability to pass some audio file and transcribe their content as text.

Features summary

This table summarizes how the assistant integrations available in Platypush compare when it comes to what I would call the foundational blocks:

Plugin Hotword STT AI responses TTS
assistant.google
assistant.openai
assistant.picovoice

And this is how they compare in terms of extra features:

Plugin Intents Offline SST
assistant.google
assistant.openai
assistant.picovoice

Let's see a few configuration examples to better understand the pros and cons of each of these integrations.

Configuration

Hardware requirements

  1. A computer, a Raspberry Pi, an old tablet, or anything in between, as long as it can run Python. At least 1GB of RAM is advised for smooth audio processing experience.

  2. A microphone.

  3. Speaker/headphones.

Installation notes

Platypush 1.0.0 has recently been released, and new installation procedures with it.

There's now official support for several package managers, a better Docker installation process, and more powerful ways to install plugins - via pip extras, Web interface, Docker and virtual environments.

The optional dependencies for any Platypush plugins can be installed via pip extras in the simplest case:

$ pip install 'platypush[plugin1,plugin2,...]'

For example, if you want to install Platypush with the dependencies for assistant.openai and assistant.picovoice:

$ pip install 'platypush[assistant.openai,assistant.picovoice]'

Some plugins however may require extra system dependencies that are not available via pip - for instance, both the OpenAI and Picovoice integrations require the ffmpeg binary to be installed, as it is used for audio conversion and exporting purposes. You can check the plugins documentation for any system dependencies required by some integrations, or install them automatically through the Web interface or the platydock command for Docker containers.

A note on the hooks

All the custom actions in this article are built through event hooks triggered by SpeechRecognizedEvent (or IntentRecognizedEvent for intents). When an intent event is triggered, or a speech event with a condition on a phrase, the assistant integrations in Platypush will prevent the default assistant response. That's to avoid cases where e.g. you say "turn off the lights", your hook takes care of running the actual action, while your voice assistant fetches a response from Google or ChatGPT along the lines of "sorry, I can't control your lights".

If you want to render a custom response from an event hook, you can do so by calling event.assistant.render_response(text), and it will be spoken using the available text-to-speech integration.

If you want to disable this behaviour, and you want the default assistant response to always be rendered, even if it matches a hook with a phrase or an intent, you can do so by setting the stop_conversation_on_speech_match parameter to false in your assistant plugin configuration.

Text-to-speech

Each of the available assistant plugins has it own default tts plugin associated:

  • assistant.google: tts, but tts.google is also available. The difference is that tts uses the (unofficial) Google Translate frontend API - it requires no extra configuration, but besides setting the input language it isn't very configurable. tts.google on the other hand uses the Google Cloud Translation API. It is much more versatile, but it requires an extra API registered to your Google project and an extra credentials file.

  • assistant.openai: tts.openai, which leverages the OpenAI text-to-speech API.

  • assistant.picovoice: tts.picovoice, which uses the (still experimental, at the time of writing) Picovoice Orca engine.

Any text rendered via assistant*.render_response will be rendered using the associated TTS plugin. You can however customize it by setting tts_plugin on your assistant plugin configuration - e.g. you can render responses from the OpenAI assistant through the Google or Picovoice engine, or the other way around.

tts plugins also expose a say action that can be called outside of an assistant context to render custom text at runtime - for example, from other event hooks, procedures, cronjobs or API calls. For example:

$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
  "type": "request",
  "action": "tts.openai.say",
  "args": {
    "text": "What a wonderful day!"
  }
}
' http://localhost:8008/execute

assistant.google

This is the oldest voice integration in Platypush - and one of the use-cases that actually motivated me into forking the previous project into what is now Platypush.

As mentioned in the previous section, this integration is built on top of a deprecated library (with no available alternatives) that just so happens to still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.

Personally it's the voice assistant I still use on most of my devices, but it's definitely not guaranteed that it will keep working in the future.

Once you have installed Platypush with the dependencies for this integration, you can configure it through these steps:

  1. Create a new project on the Google developers console and generate a new set of credentials for it. Download the credentials secrets as JSON.
  2. Generate scoped credentials from your secrets.json.
  3. Configure the integration in your config.yaml for Platypush (see the configuration page for more details):
assistant.google:
  # Default: ~/.config/google-oauthlib-tool/credentials.json
  # or <PLATYPUSH_WORKDIR>/credentials/google/assistant.json
  credentials_file: /path/to/credentials.json
  # Default: no sound is played when "Ok Google" is detected
  conversation_start_sound: /path/to/sound.mp3

Restart the service, say "Ok Google" or "Hey Google" while the microphone is active, and everything should work out of the box.

You can now start creating event hooks to execute your custom voice commands. For example, if you configured a lights plugin (e.g. light.hue) and a music plugin (e.g. music.mopidy), you can start building voice commands like these:

from platypush import run, when
from platypush.events.assistant import (
  ConversationStartEvent, SpeechRecognizedEvent
)

light_plugin = "light.hue"
music_plugin = "music.mopidy"

@when(ConversationStartEvent)
def pause_music_when_conversation_starts():
  run(f"{music_plugin}.pause_if_playing")

# Note: (limited) support for regular expressions on `phrase`
# This hook will match any phrase containing either "turn on the lights"
# or "turn off the lights"
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def lights_on_command():
  run(f"{light_plugin}.on")
  # Or, with arguments:
  # run(f"{light_plugin}.on", groups=["Bedroom"])

@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
def lights_off_command():
  run(f"{light_plugin}.off")

@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music_command():
  run(f"{music_plugin}.play")

@when(SpeechRecognizedEvent, phrase="stop (the)? music")
def stop_music_command():
  run(f"{music_plugin}.stop")

Or, via YAML:

# Add to your config.yaml, or to one of the files included in it

event.hook.pause_music_when_conversation_starts:
  if:
    type: platypush.message.event.ConversationStartEvent

  then:
    - action: music.mopidy.pause_if_playing

event.hook.lights_on_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "turn on (the)? lights"

  then:
    - action: light.hue.on
    # args:
    #   groups:
    #     - Bedroom

event.hook.lights_off_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "turn off (the)? lights"

  then:
    - action: light.hue.off

event.hook.play_music_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "play (the)? music"

  then:
    - action: music.mopidy.play

event.hook.stop_music_command:
  if:
    type: platypush.message.event.SpeechRecognizedEvent
    phrase: "stop (the)? music"

  then:
    - action: music.mopidy.stop

Parameters are also supported on the phrase event argument through the ${} template construct. For example:

from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def on_play_track_command(
    event: SpeechRecognizedEvent, title: str, artist: str
):
    results = run(
        "music.mopidy.search",
        filter={"title": title, "artist": artist}
    )

    if not results:
        event.assistant.render_response(f"Couldn't find {title} by {artist}")
        return

    run("music.mopidy.play", resource=results[0]["uri"])

Pros

  • 👍 Very fast and robust API.
  • 👍 Easy to install and configure.
  • 👍 It comes with almost all the features of a voice assistant installed on Google hardware - except some actions native to Android-based devices and video/display features. This means that features such as timers, alarms, weather forecast, setting the volume or controlling Chromecasts on the same network are all supported out of the box.
  • 👍 It connects to your Google account (can be configured from your Google settings), so things like location-based suggestions and calendar events are available. Support for custom actions and devices configured in your Google Home app is also available out of the box, although I haven't tested it in a while.
  • 👍 Good multi-language support. In most of the cases the assistant seems quite capable of understanding questions in multiple language and respond in the input language without any further configuration.

Cons

  • 👎 Based on a deprecated API that could break at any moment.
  • 👎 Limited hardware support (only x86_64 and RPi 3/4).
  • 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available.
  • 👎 Not possible to configure the output voice - it can only use the stock Google Assistant voice.
  • 👎 No support for intents - something similar was available (albeit tricky to configure) through the Actions SDK, but that has also been abandoned by Google.
  • 👎 Not very modular. Both assistant.picovoice and assistant.openai have been built by stitching together different independent APIs. Those plugins are therefore quite modular. You can choose for instance to run only the hotword engine of assistant.picovoice, which in turn will trigger the conversation engine of assistant.openai, and maybe use tts.google to render the responses. By contrast, given the relatively monolithic nature of google-assistant-library, which runs the whole service locally, if your instance runs assistant.google then it can't run other assistant plugins.

assistant.picovoice

The assistant.picovoice integration is available from Platypush 1.0.0.

Previous versions had some outdated sst.picovoice.* plugins for the individual products, but they weren't properly tested and they weren't combined together into a single integration that implements the Platypush' assistant API.

This integration is built on top of the voice products developed by Picovoice. These include:

  • Porcupine: a fast and customizable engine for hotword/wake-word detection. It can be enabled by setting hotword_enabled to true in the assistant.picovoice plugin configuration.

  • Cheetah: a speech-to-text engine optimized for real-time transcriptions. It can be enabled by setting stt_enabled to true in the assistant.picovoice plugin configuration.

  • Leopard: a speech-to-text engine optimized for offline transcriptions of audio files.

  • Rhino: a speech-to-intent engine.

  • Orca: a text-to-speech engine.

You can get your personal access key by signing up at the Picovoice console. You may be asked to submit a reason for using the service (feel free to mention a personal Platypush integration), and you will receive your personal access key.

If prompted to select the products you want to use, make sure to select the ones from the Picovoice suite that you want to use with the assistant.picovoice plugin.

A basic plugin configuration would like this:

assistant.picovoice:
  access_key: YOUR_ACCESS_KEY

  # Keywords that the assistant should listen for
  keywords:
    - alexa
    - computer
    - ok google

  # Paths to custom keyword files
  # keyword_paths:
  #   - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn

  # Enable/disable the hotword engine
  hotword_enabled: true
  # Enable the STT engine
  stt_enabled: true

  # conversation_start_sound: ...

  # Path to a custom model to be used to speech-to-text
  # speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv

  # Path to an intent model. At least one custom intent model is required if
  # you want to enable intent detection.
  # intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn

Hotword detection

If enabled through the hotword_enabled parameter (default: True), the assistant will listen for a specific wake word before starting the speech-to-text or intent recognition engines. You can specify custom models for your hotword (e.g. on the same device you may use "Alexa" to trigger the speech-to-text engine in English, "Computer" to trigger the speech-to-text engine in Italian, and "Ok Google" to trigger the intent recognition engine).

You can also create your custom hotword models using the Porcupine console.

If hotword_enabled is set to True, you must also specify the keywords parameter with the list of keywords that you want to listen for, and optionally the keyword_paths parameter with the paths to the any custom hotword models that you want to use. If hotword_enabled is set to False, then the assistant won't start listening for speech after the plugin is started, and you will need to programmatically start the conversation by calling the assistant.picovoice.start_conversation action.

When a wake-word is detected, the assistant will emit a HotwordDetectedEvent that you can use to build your custom logic.

By default, the assistant will start listening for speech after the hotword if either stt_enabled or intent_model_path are set. If you don't want the assistant to start listening for speech after the hotword is detected (for example because you want to build your custom response flows, or trigger the speech detection using different models depending on the hotword that is used, or because you just want to detect hotwords but not speech), then you can also set the start_conversation_on_hotword parameter to false. If that is the case, then you can programmatically start the conversation by calling the assistant.picovoice.start_conversation method in your event hooks:

from platypush import when, run
from platypush.message.event.assistant import HotwordDetectedEvent

# Start a conversation using the Italian language model when the
# "Buongiorno" hotword is detected
@when(HotwordDetectedEvent, hotword='Buongiorno')
def on_it_hotword_detected(event: HotwordDetectedEvent):
    event.assistant.start_conversation(model_file='path/to/it.pv')

Speech-to-text

If you want to build your custom STT hooks, the approach is the same seen for the assistant.google plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.

Speech-to-intent

Intents are structured actions parsed from unstructured human-readable text.

Unlike with hotword and speech-to-text detection, you need to provide a custom model for intent detection. You can create your custom model using the Rhino console.

When an intent is detected, the assistant will emit an IntentRecognizedEvent and you can build your custom hooks on it.

For example, you can build a model to control groups of smart lights by defining the following slots on the Rhino console:

  • device_state: The new state of the device (e.g. with on or off as supported values)

  • room: The name of the room associated to the group of lights to be controlled (e.g. living room, kitchen, bedroom)

You can then define a lights_ctrl intent with the following expressions:

  • "turn $device_state:state the lights"
  • "turn $device_state:state the $room:room lights"
  • "turn the lights $device_state:state"
  • "turn the $room:room lights $device_state:state"
  • "turn $room:room lights $device_state:state"

This intent will match any of the following phrases:

  • "turn on the lights"
  • "turn off the lights"
  • "turn the lights on"
  • "turn the lights off"
  • "turn on the living room lights"
  • "turn off the living room lights"
  • "turn the living room lights on"
  • "turn the living room lights off"

And it will extract any slots that are matched in the phrases in the IntentRecognizedEvent.

Train the model, download the context file, and pass the path on the intent_model_path parameter.

You can then register a hook to listen to a specific intent:

from platypush import when, run
from platypush.events.assistant import IntentRecognizedEvent

@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
def on_turn_on_lights(event: IntentRecognizedEvent):
    room = event.slots.get('room')
    if room:
        run("light.hue.on", groups=[room])
    else:
        run("light.hue.on")

Note that if both stt_enabled and intent_model_path are set, then both the speech-to-text and intent recognition engines will run in parallel when a conversation is started.

The intent engine is usually faster, as it has a smaller set of intents to match and doesn't have to run a full speech-to-text transcription. This means that, if an utterance matches both a speech-to-text phrase and an intent, the IntentRecognizedEvent event is emitted (and not SpeechRecognizedEvent).

This may not be always the case though. So, if you want to use the intent detection engine together with the speech detection, it may be a good practice to also provide a fallback SpeechRecognizedEvent hook to catch the text if the speech is not recognized as an intent:

from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
    if room:
        run("light.hue.on", groups=[room])
    else:
        run("light.hue.on")

Text-to-speech and response management

The text-to-speech engine, based on Orca, is provided by the tts.picovoice plugin.

However, the Picovoice integration won't provide you with automatic AI-generated responses for your queries. That's because Picovoice doesn't seem to offer (yet) any products for conversational assistants, either voice-based or text-based.

You can however leverage the render_response action to render some text as speech in response to a user command, and that in turn will leverage the Picovoice TTS plugin to render the response.

For example, the following snippet provides a hook that:

  • Listens for SpeechRecognizedEvent.

  • Matches the phrase against a list of predefined commands that shouldn't require an AI-generated response.

  • Has a fallback logic that leverages openai.get_response to generate a response through a ChatGPT model and render it as audio.

Also, note that any text rendered over the render_response action that ends with a question mark will automatically trigger a follow-up - i.e. the assistant will wait for the user to answer its question.

import re

from platypush import hook, run
from platypush.message.event.assistant import SpeechRecognizedEvent

def play_music():
    run("music.mopidy.play")

def stop_music():
    run("music.mopidy.stop")

def ai_assist(event: SpeechRecognizedEvent):
    response = run("openai.get_response", prompt=event.phrase)
    if not response:
        return

    run("assistant.picovoice.render_response", text=response)

# List of commands to match, as pairs of regex patterns and the
# corresponding actions
hooks = (
    (re.compile(r"play (the)?music", re.IGNORECASE), play_music),
    (re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
    # ...
    # Fallback to the AI assistant
    (re.compile(r".*"), ai_assist),
)

@when(SpeechRecognizedEvent)
def on_speech_recognized(event, **kwargs):
    for pattern, command in hooks:
        if pattern.search(event.phrase):
            run("logger.info", msg=f"Running voice command: {command.__name__}")
            command(event, **kwargs)
            break

Offline speech-to-text

An assistant.picovoice.transcribe action is provided for offline transcriptions of audio files, using the Leopard models.

You can easily call it from your procedures, hooks or through the API:

$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
  "type": "request",
  "action": "assistant.picovoice.transcribe",
  "args": {
    "audio_file": "/path/to/some/speech.mp3"
  }
}' http://localhost:8008/execute

{
  "transcription": "This is a test",
  "words": [
    {
      "word": "this",
      "start": 0.06400000303983688,
      "end": 0.19200000166893005,
      "confidence": 0.9626294374465942
    },
    {
      "word": "is",
      "start": 0.2879999876022339,
      "end": 0.35199999809265137,
      "confidence": 0.9781675934791565
    },
    {
      "word": "a",
      "start": 0.41600000858306885,
      "end": 0.41600000858306885,
      "confidence": 0.9764975309371948
    },
    {
      "word": "test",
      "start": 0.5120000243186951,
      "end": 0.8320000171661377,
      "confidence": 0.9511580467224121
    }
  ]
}

Pros

  • 👍 The Picovoice integration is extremely configurable. assistant.picovoice stitches together five independent products developed by a small company specialized in voice products for developers. As such, Picovoice may be the best option if you have custom use-cases. You can pick which features you need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you have plenty of flexibility in building your integrations.

  • 👍 Runs (or seems to run) (mostly) on device. This is something that we can't say about the other two integrations discussed in this article. If keeping your voice interactions 100% hidden from Google's or Microsoft's eyes is a priority, then Picovoice may be your best bet.

  • 👍 Rich features. It uses different models for different purposes - for example, Cheetah models are optimized for real-time speech detection, while Leopard is optimized for offline transcription. Moreover, Picovoice is the only integration among those analyzed in this article to support speech-to-intent.

  • 👍 It's very easy to build new models or customize existing ones. Picovoice has a powerful developers console that allows you to easily create hotword models, tweak the priority of some words in voice models, and create custom intent models.

Cons

  • 👎 The business model is still a bit weird. It's better than the earlier "write us an email with your business case and we'll reach back to you", but it still requires you to sign up with a business email and write a couple of lines on what you want to build with their products. It feels like their focus is on a B2B approach rather than "open up and let the community build stuff", and that seems to create unnecessary friction.

  • 👎 No native conversational features. At the time of writing, Picovoice doesn't offer products that generate AI responses given voice or text prompts. This means that, if you want AI-generated responses to your queries, you'll have to do requests to e.g. openai.get_response(prompt) directly in your hooks for SpeechRecognizedEvent, and render the responses through assistant.picovoice.render_response. This makes the use of assistant.picovoice alone more fit to cases where you want to mostly create voice command hooks rather than have general-purpose conversations.

  • 👎 Speech-to-text, at least on my machine, is slower than the other two integrations, and the accuracy with non-native accents is also much lower.

  • 👎 Limited support for any languages other than English. At the time of writing hotword detection with Porcupine seems to be in a relative good shape with support for 16 languages. However, both speech-to-text and text-to-speech only support English at the moment.

  • 👎 Some APIs are still quite unstable. The Orca text-to-speech API, for example, doesn't even support text that includes digits or some punctuation characters - at least not at the time of writing. The Platypush integration fills the gap with workarounds that e.g. replace words to numbers and replace punctuation characters, but you definitely have a feeling that some parts of their products are still work in progress.

assistant.openai

This integration has been released in Platypush 1.0.7.

It uses the following OpenAI APIs:

  • /audio/transcriptions for speech-to-text. At the time of writing the default model is whisper-1. It can be configured through the model setting on the assistant.openai plugin configuration. See the OpenAI documentation for a list of available models.
  • /chat/completions to get AI-generated responses using a GPT model. At the time of writing the default is gpt-3.5-turbo, but it can be configurable through the model setting on the openai plugin configuration. See the OpenAI documentation for a list of supported models.
  • /audio/speech for text-to-speech. At the time of writing the default model is tts-1 and the default voice is nova. They can be configured through the model and voice settings respectively on the tts.openai plugin. See the OpenAI documentation for a list of available models and voices.

You will need an OpenAI API key associated to your account.

A basic configuration would like this:

openai:
  api_key: YOUR_OPENAI_API_KEY  # Required
  # conversation_start_sound: ...
  # model: ...
  # context: ...
  # context_expiry: ...
  # max_tokens: ...

assistant.openai:
  # model: ...
  # tts_plugin: some.other.tts.plugin

tts.openai:
  # model: ...
  # voice: ...

If you want to build your custom hooks on speech events, the approach is the same seen for the other assistant plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.

Hotword support

OpenAI doesn't provide an API for hotword detection, nor a small model for offline detection.

This means that, if no other assistant plugins with stand-alone hotword support are configured (only assistant.picovoice for now), a conversation can only be triggered by calling the assistant.openai.start_conversation action.

If you want hotword support, then the best bet is to add assistant.picovoice to your configuration too - but make sure to only enable hotword detection and not speech detection, which will be delegated to assistant.openai via event hook:

assistant.picovoice:
  access_key: ...
  keywords:
    - computer

  hotword_enabled: true
  stt_enabled: false
  # conversation_start_sound: ...

Then create a hook that listens for HotwordDetectedEvent and calls assistant.openai.start_conversation:

from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent, hotword="computer")
def on_hotword_detected():
  run("assistant.openai.start_conversation")

Conversation contexts

The most powerful feature offered by the OpenAI assistant is the fact that it leverages the conversation contexts provided by the OpenAI API.

This means two things:

  1. Your assistant can be initialized/tuned with a static context. It is possible to provide some initialization context to the assistant that can fine tune how the assistant will behave, (e.g. what kind of tone/language/approach will have when generating the responses), as well as initialize the assistant with some predefined knowledge in the form of hypothetical past conversations. Example:
openai:
   # ...

   context:
       # `system` can be used to initialize the context for the expected tone
       # and language in the assistant responses
       - role: system
         content: >
             You are a voice assistant that responds to user queries using
             references to Lovecraftian lore.

       # `user`/`assistant` interactions can be used to initialize the
       # conversation context with previous knowledge. `user` is used to
       # emulate previous user questions, and `assistant` models the
       # expected response.
       - role: user
         content: What is a telephone?
       - role: assistant
         content: >
             A Cthulhuian device that allows you to communicate with
             otherworldly beings. It is said that the first telephone was
             created by the Great Old Ones themselves, and that it is a
             gateway to the void beyond the stars.
 If you now start Platypush and ask a question like "*how does it work?*",
 the voice assistant may give a response along the lines of:

 ```
 The telephone functions by harnessing the eldritch energies of the cosmos to
 transmit vibrations through the ether, allowing communication across vast
 distances with entities from beyond the veil. Its operation is shrouded in
 mystery, for it relies on arcane principles incomprehensible to mortal
 minds.
 ```

 Note that:

 1. The style of the response is consistent with that initialized in the
    `context` through `system` roles.

 2. Even though a question like "*how does it work?*" is not very specific,
    the assistant treats the `user`/`assistant` entries given in the context
    as if they were the latest conversation prompts. Thus it realizes that
    "*it*", in this context, probably means "*the telephone*".
  1. The assistant has a runtime context. It will remember the recent conversations for a given amount of time (configurable through the context_expiry setting on the openai plugin configuration). So, even without explicit context initialization in the openai plugin, the plugin will remember the last interactions for (by default) 10 minutes. So if you ask "who wrote the Divine Comedy?", and a few seconds later you ask "where was its writer from?", you may get a response like "Florence, Italy" - i.e. the assistant realizes that "the writer" in this context is likely to mean "the writer of the work that I was asked about in the previous interaction" and return pertinent information.

Pros

  • 👍 Speech detection quality. The OpenAI speech-to-text features are the best among the available assistant integrations. The transcribe API so far has detected my non-native English accent right nearly 100% of the times (Google comes close to 90%, while Picovoice trails quite behind). And it even detects the speech of my young kid - something that the Google Assistant library has always failed to do right.

  • 👍 Text-to-speech quality. The voice models used by OpenAI sound much more natural and human than those of both Google and Picovoice. Google's and Picovoice's TTS models are actually already quite solid, but OpenAI outclasses them when it comes to voice modulation, inflections and sentiment. The result sounds intimidatingly realistic.

  • 👍 AI responses quality. While the scope of the Google Assistant is somewhat limited by what people expected from voice assistants until a few years ago (control some devices and gadgets, find my phone, tell me the news/weather, do basic Google searches...), usually without much room for follow-ups, assistant.openai will basically render voice responses as if you were typing them directly to ChatGPT. While Google would often respond you with a "sorry, I don't understand", or "sorry, I can't help with that", the OpenAI assistant is more likely to expose its reasoning, ask follow-up questions to refine its understanding, and in general create a much more realistic conversation.

  • 👍 Contexts. They are an extremely powerful way to initialize your assistant and customize it to speak the way you want, and know the kind of things that you want it to know. Cross-conversation contexts with configurable expiry also make it more natural to ask something, get an answer, and then ask another question about the same topic a few seconds later, without having to reintroduce the assistant to the whole context.

  • 👍 Offline transcriptions available through the openai.transcribe action.

  • 👍 Multi-language support seems to work great out of the box. Ask something to the assistant in any language, and it'll give you a response in that language.

  • 👍 Configurable voices and models.

Cons

  • 👎 The full pack of features is only available if you have an API key associated to a paid OpenAI account.

  • 👎 No hotword support. It relies on assistant.picovoice for hotword detection.

  • 👎 No intents support.

  • 👎 No native support for weather forecast, alarms, timers, integrations with other services/devices nor other features available out of the box with the Google Assistant. You can always create hooks for them though.

Weather forecast example

Both the OpenAI and Picovoice integrations lack some features available out of the box on the Google Assistant - weather forecast, news playback, timers etc. - as they rely on voice-only APIs that by default don't connect to other services.

However Platypush provides many plugins to fill those gaps, and those features can be implemented with custom event hooks.

Let's see for example how to build a simple hook that delivers the weather forecast for the next 24 hours whenever the assistant gets a phrase that contains the "weather today" string.

You'll need to enable a weather plugin in Platypush - weather.openweathermap will be used in this example. Configuration:

weather.openweathermap:
  token: OPENWEATHERMAP_API_KEY
  location: London,GB

Then drop a script named e.g. weather.py in the Platypush scripts directory (default: <CONFDIR>/scripts) with the following content:

from datetime import datetime
from textwrap import dedent
from time import time

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='weather today')
def weather_forecast(event: SpeechRecognizedEvent):
    limit = time() + 24 * 60 * 60  # 24 hours from now
    forecast = [
        weather
        for weather in run("weather.openweathermap.get_forecast")
        if datetime.fromisoformat(weather["time"]).timestamp() < limit
    ]

    min_temp = round(
        min(weather["temperature"] for weather in forecast)
    )
    max_temp = round(
        max(weather["temperature"] for weather in forecast)
    )
    max_wind_gust = round(
        (max(weather["wind_gust"] for weather in forecast)) * 3.6
    )
    summaries = [weather["summary"] for weather in forecast]
    most_common_summary = max(summaries, key=summaries.count)
    avg_cloud_cover = round(
        sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
    )

    event.assistant.render_response(
        dedent(
            f"""
            The forecast for today is: {most_common_summary}, with
            a minimum of {min_temp} and a maximum of {max_temp}
            degrees, wind gust of {max_wind_gust} km/h, and an
            average cloud cover of {avg_cloud_cover}%.
            """
        )
    )

This script will work with any of the available voice assistants.

You can also implement something similar for news playback, for example using the rss plugin to get the latest items in your subscribed feeds. Or to create custom alarms using the alarm plugin, or a timer using the utils.set_timeout action.

Conclusions

The past few years have seen a lot of things happen in the voice industry. Many products have gone out of market, been deprecated or sunset, but not all hope is lost. The OpenAI and Picovoice products, especially when combined together, can still provide a good out-of-the-box voice assistant experience. And the OpenAI products have also raised the bar on what to expect from an AI-based assistant.

I wish that there were still some fully open and on-device alternatives out there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google provide the best voice experience as of now, but of course they come with trade-offs - namely the great amount of data points you feed to these cloud-based services. Picovoice is somewhat a trade-off, as it runs at least partly on-device, but their business model is still a bit fuzzy and it's not clear whether they intend to have their products used by the wider public or if it's mostly B2B.

I'll keep an eye however on what is going to come from the ashes of Mycroft under the form of the OpenConversational project, and probably keep you up-to-date when there is a new integration to share.

Reactions

How to interact with this page

Webmentions

To interact via Webmentions, send an activity that references this URL from a platform that supports Webmentions, such as Lemmy, WordPress with Webmention plugins, or any IndieWeb-compatible site.

ActivityPub

  • Follow @blog@platypush.tech on your ActivityPub platform (e.g. Mastodon, Misskey, Pleroma, Lemmy).
  • Mention @blog@platypush.tech in a post to feature on the Guestbook.
  • Search for this URL on your instance to find and interact with the post.
  • Like, boost, quote, or reply to the post to feature your activity here.
📣 1 🔗 1
Fabio Manganiello

Those who have followed me for a while know of my personal obsession with self-built voice assistants.

My experiments over the years can be summarized as it follows:

  • 2007: Voxifera, my very first attempt at building a primitive voice assistant using Hidden Markov models. Definitely not good for general-purpose usage, but good enough in 2007 to distinguish between a dozen of simple voice commands.

  • 2019: First voice assistant built on top of Platypush. It used the now deprecated Google Assistant Library on top of a Raspberry Pi with a microphone and a speaker, and it could hook any automation routines and custom commands to it through event hooks.

  • 2020: Second iteration on #platypush, this time supporting other assistant plugins too - Alexa (integration now removed), Snowboy (also removed, since the project is dead), Mozilla DeepSpeech (also removed now, since Mozilla discontinued it), PicoVoice, and mimic3 (the text-to-speech engine built on top of Mycroft, now bankrupt).

  • 2024: Third iteration on Platypush, this time with an enhanced PicoVoice integration and new speech-to-text and text-to-speech plugins based on the OpenAI APIs.

But it's now 2026, and perhaps both the hardware and the software are now mature enough for fully on-device voice assistants based on fully open solutions likely to stick around for a while.

In this article we'll wire that gap closed with Platypush:

The result is not another cloud assistant with a different coat of paint. The hotword engine, speech recognition, command dispatch and speech synthesis can all run on-device. If the openai step points to a local OpenAI-compatible server, then the whole pipeline can stay on your LAN too.

The pipeline

The architecture can be summarized as follows:

listens

emits

hotword detected

emits

speech recognized

phrase matches local command

generic response

text to speech

text to speech

process intent

play speech response

follow up

conversation end

conversation end

Microphone

assistant.openwakeword

HotwordDetectedEvent

assistant.vosk.start_conversation

ConversationStartEvent

SpeechRecognizedEvent

Local command hooks

openai.get_response

tts.piper

Speaker

ConversationEndEvent

listens

emits

hotword detected

emits

speech recognized

phrase matches local command

generic response

text to speech

text to speech

process intent

play speech response

follow up

conversation end

conversation end

Microphone

assistant.openwakeword

HotwordDetectedEvent

assistant.vosk.start_conversation

ConversationStartEvent

SpeechRecognizedEvent

Local command hooks

openai.get_response

tts.piper

Speaker

ConversationEndEvent

Hotword detection ("OK Google", "Alexa" etc.) is a continuous, low-latency workload, and it should not need the network.

Speech-to-text is also a good fit for local inference: Vosk models are small enough to run on modest hardware, including Raspberry Pis, and they are perfectly adequate for short home automation commands.

Text-to-speech is another place where local models are good enough nowadays: Piper voices are fast, small and much nicer than the old robotic espeak-style fallback.

The only optional network-shaped piece is the language model.

But that is a policy choice, not a requirement of the voice stack.

Setup

Clone the assistant sample repository:

git clone https://git.platypush.tech/platypush/assistant-sample
cd assistant-sample

Models

The next step is to download the voice models used by the voice stack.

Hotword Detection

When the service starts the first time, it will automatically download all the available models.

You can then use the following command to list the available models once the service is running:

curl -s -XPOST \
     -H 'Content-type: application/json' \
     -H "Authorization: Bearer $PLATYPUSH_TOKEN" \
     -d '{"type":"request", "action":"assistant.openwakeword.list_models"}' \
     http://localhost:8008/execute

Where $PLATYPUSH_TOKEN is the token of the user that is running the service.

You can retrieve it by connecting to http://localhost:8008 when the service starts for the first time. Create your credentials, then select Settings -> Tokens -> Generate API Token.

Speech-to-text

A full list of the Vosk voice models is available here.

Some feedback about the quality of the English models:

Model Size Notes
vosk-model-small-en-us-0.15 40 MB Very fast and lightweight model that can also run on an old Raspberry Pi, but accuracy can be low.
vosk-model-en-us-0.22-lgraph 128 MB Reasonably accurate on clear speech and with native speakers, but still small enough to run fine even on a Raspberry Pi.
vosk-model-en-us-0.22 1.8 GB Accurate generic US English model. Fast on an laptop or x86 processor, but it may be a bit heavy on a Raspberry Pi.

Download the selected model to the Docker volume working directory:

mkdir -p ./workdir/assistant.vosk/models
cd ./workdir/assistant.vosk/models
wget "https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-lgraph.zip"
unzip "vosk-model-en-us-0.22-lgraph.zip"
rm "vosk-model-en-us-0.22-lgraph.zip"

Text-to-speech

Download a speech synthesis model from here.

Audio samples are also available to get an idea of the type of voice before downloading.

The model usually consists of a *.onnx and a *.onnx.json file. Download both of them to the Docker volume working directory:

mkdir -p ./workdir/piper_tts
cd ./workdir/piper_tts
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx"
wget "https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/hfc_female/medium/en_US-hfc_female-medium.onnx.json"

Configuration

Copy and edit the example configuration file.

cp config/config.example.yaml config/config.yaml

Home automation plugins

The assistant becomes useful once recognized speech can reach the rest of the house.

For example, Hue lights:

light.hue:
  bridge: hue
  groups:
    - Living Room

And MPD/Mopidy for music:

music.mopidy:
  host: localhost

music.mpd:
  host: localhost
  poll_interval: null

Those are just regular Platypush plugins.

The assistant does not need special knowledge about Hue, MPD, Chromecast, Zigbee, MQTT or anything else.

It only needs to emit events; your hooks decide what to do with them.

Build

Build the container image for the assistant service:

docker build -t platypush-voice .

Run

The assistant needs access to the host microphone and speakers. The container routes ALSA through PulseAudio, so the examples below connect it to a PulseAudio server running on the host.

Linux

With PulseAudio or pipewire-pulseaudio installed:

docker run --rm \
  -e PULSE_SERVER=unix:/run/pulse/native \
  -v /run/user/$(id -u)/pulse/native:/run/pulse/native \
  --name voice-assistant \
  -p 8008:8008 \
  -v ./config:/etc/platypush \
  -v ./workdir:/var/lib/platypush \
  platypush-voice

macOS

Install and start PulseAudio on the host:

brew install pulseaudio
pulseaudio --daemonize=yes --exit-idle-time=-1
pactl load-module module-native-protocol-tcp \
  auth-anonymous=1 \
  listen=0.0.0.0 \
  port=4713

Then start the container:

docker run --rm \
  -e PULSE_SERVER=tcp:host.docker.internal:4713 \
  --name voice-assistant \
  -p 8008:8008 \
  -v "$(pwd)/config:/etc/platypush" \
  -v "$(pwd)/workdir:/var/lib/platypush" \
  platypush-voice

If pactl load-module reports that the module is already loaded, you can keep using the existing PulseAudio daemon.

Windows

Install PulseAudio for Windows, then create a default.pa file in the same directory as pulseaudio.exe:

load-module module-waveout sink_name=output source_name=input record=1
load-module module-native-protocol-tcp auth-anonymous=1 listen=0.0.0.0 port=4713
set-default-sink output
set-default-source input

Start PulseAudio from PowerShell:

.\pulseaudio.exe -F .\default.pa --exit-idle-time=-1

Then start the container from the repository directory:

docker run --rm `
  -e PULSE_SERVER=tcp:host.docker.internal:4713 `
  --name voice-assistant `
  -p 8008:8008 `
  -v "${PWD}/config:/etc/platypush" `
  -v "${PWD}/workdir:/var/lib/platypush" `
  platypush-voice

Make sure microphone access is enabled for desktop applications under Windows privacy settings, and allow PulseAudio through the firewall if prompted.

Usage

Once the service is running, you can start interact with it with voice commands (the default activation word is "Alexa").

Any questions about the weather will be resolved by the weather plugin if it's been enabled.

If the music or lights plugins are enabled, they can be controlled with voice commands ("stop the music", "turn on the lights", etc.)

Otherwise, the assistant will use the openai plugin to respond to your questions, with follow-up turns when the response from OpenAI is also a question.

Extending the Assistant

The assistant logic is modeled through simple Platypush hooks under config/scripts.

You can extend it as you like by defining your own hooks or modifying the existing ones.

Starting a conversation

Conversations are started by hooking to the HotwordDetectedEvent.

import logging

from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent

logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"


@when(HotwordDetectedEvent)
def on_hotword_detected(event: HotwordDetectedEvent):
    """
    When the hotword is detected, start a conversation.
    """
    logger.info(f"Hotword {event.hotword} detected")
    run(f"{assistant_plugin}.start_conversation")

Deterministic commands

For common home automation commands, regular event hooks are still the best tool. They are fast, inspectable, and they do not hallucinate.

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent


@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def turn_on_lights():
    """
    Hook run when the user says "turn on the lights" (regex)
    """
    run("light.hue.on")


@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music():
    """
    Hook run when the user says "play the music" (regex)
    """
    run("music.mpd.play")


@when(SpeechRecognizedEvent, phrase="set the music volume (to|on|at) ${volume}")
def set_volume(volume: int):
    """
    Hook run when the user says "set the music volume to ${volume}"
    (regex with parameter).
    """
    run("music.mpd.set_volume", volume=volume)

AI Commands

If the openai plugin is enabled, you can use it to help you answer questions.

There are two generic use-cases for voice assistants where an AI plugin is beneficial:

  • Speech to Intent
  • Response fallback

Speech to Intent

You may want this for general questions, for commands that do not fit a neat regular expression, or for transforming a raw sentence such as:

make it a bit darker and reduce the music volume

into a structured action plan like.

[
  {
    "action": "light.hue.set_lights",
    "args": {
      "bri": 50
    }
  },
  {
    "action": "music.mpd.set_volume",
    "args": {
      "volume": 20
    }
  }
]

An example provided in the assistant sample is that of weather forecasting.

Note in particular the usage of openai.get_response with a well crafted system prompt that turns a natural language request like:

What's the weather tomorrow in San Francisco?

Into:

{
  "type": "weather",
  "delta_days": 1,
  "location": "San Francisco"
}
def parse_weather_request(request: str) -> WeatherRequest | None:
    request_dict = (
        run(
            "openai.get_response",
            context=[
                {
                    "role": "system",
                    "content": (
                        "You are a voice assistant provided with weather requests as free text.\n"
                        "Given the prompt, return a structured JSON representation of the request in the following format: "
                        '{ "type": "weather", "delta_days": 1, "location": "San Francisco" }, '
                        'where both delta_days and location are optional (e.g. if the user simply asks "How\'s the weather?".\n'
                        'If the prompt doesn\'t seem to contain a weather request, return { "type": null }'
                    ),
                }
            ],
            prompt=request,
        )
        or {}
    )

    if request_dict.get("type") != "weather":
        return None

    weather_request = WeatherRequest(
        location=request_dict.get("location", default_location),
        delta_days=request_dict.get("delta_days", 0),
    )

    return weather_request

You can also use the model for intermediate transformation instead of direct answers. For example, ask it to return a tiny JSON object with action and args, then dispatch only actions you explicitly allow:

ALLOWED_ACTIONS = {
    "lights.on": "light.hue.on",
    "lights.off": "light.hue.off",
    "music.play": "music.mpd.play",
    "music.stop": "music.mpd.stop",
}


@when(SpeechRecognizedEvent)
def on_fuzzy_command(event):
    plan = run(
        "openai.get_response",
        prompt=event.phrase,
        context=[
            {
                "role": "system",
                "content": (
                    "Map the user command to JSON only: "
                    '{"action": "...", "args": {...}}. '
                    f"Allowed actions: {', '.join(ALLOWED_ACTIONS)}. "
                    "If none match, return {\"action\": null, \"args\": {}}."
                ),
            }
        ],
    )

    # Parse `plan` as JSON here, validate it, then run only an allow-listed action.

That last validation step matters. A model may be useful for interpretation, but it should not get arbitrary access to run().

Response fallback

If a request doesn't match any of the commands you have defined, you can use a generic SpeechRecognizedEvent hook to forward the request to an AI plugin, and render the response as speech through the text-to-speech plugin.

import logging

from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent

logger = logging.getLogger(__name__)
ai_plugin = "openai"
assistant_plugin = "assistant.vosk"


@when(SpeechRecognizedEvent, plugin=assistant_plugin)
def on_speech_recognized(event: SpeechRecognizedEvent):
    """
    Generic handler for speech recognition events received
    by the configured assistant plugin.
    """
    logger.info("Recognized speech: %s", event.phrase)

    # Forward the request to OpenAI and render the response as speech
    response = run(
        f"{ai_plugin}.get_response",
        prompt=event.phrase,
        context=[
            {
                "role": "system",
                "content": (
                    "You are a voice assistant that can answer questions and perform actions. "
                    "Keep in mind that prompts are transcriptions of user speech and they may "
                    "contain misspellings or errors. Try and interpret them as best as possible. "
                    "When possible, keep your answers short and concise."
                ),
            }
        ],
    )

    # If the response is not empty, render it using the TTS plugin
    if response:
        event.assistant.render_response(response)

When a response from the LLM ends with a question mark, the assistant will automatically listen for a follow-up command and fire a new SpeechRecognizedEvent.

Pausing music while listening

One nice touch is to pause the music when a conversation starts and resume it after the assistant is done.

from platypush import run, when
from platypush.events.assistant import (
    ConversationEndEvent,
    ConversationStartEvent,
)


@when(ConversationStartEvent)
def on_conversation_start():
    try:
        run("utils.clear_timeout", name="ConversationEndTimeout")
    except Exception as e:
        logger.error("Error clearing conversation end timeout: %s", e)

    run("music.mpd.pause_if_playing")


@when(ConversationEndEvent)
def on_conversation_end():
    run(
        "utils.set_timeout",
        name="ConversationEndTimeout",
        seconds=5,
        actions=[{"action": "music.mpd.play_if_paused"}],
    )

That makes the interaction feel much less clumsy: wake word, music ducks or pauses, command is recognized, answer is spoken, music resumes a few seconds later.

Going fully local

With the configuration above, hotword detection, speech-to-text, automation and text-to-speech are already local. The only non-local component is the openai plugin, if it points to OpenAI's servers.

To make the last step local too, run a model server that exposes an OpenAI-compatible API. Ollama, llama.cpp server, vLLM and LocalAI can all expose some version of /v1/chat/completions.

For example, with Ollama:

ollama pull llama3.1:8b
ollama serve

The OpenAI-compatible endpoint is then usually available at:

http://127.0.0.1:11434/v1/chat/completions

If your Platypush openai plugin version supports a custom API base URL, the configuration is the whole change:

openai:
  model: llama3.1:8b
  base_url: http://127.0.0.1:11434/v1

If it does not, keep the rest of the assistant exactly the same and replace only the fallback action with a tiny local request:

That is enough to turn the assistant into a fully local stack:

OpenWakeWord

Vosk

Platypush Hooks

Local OpenAI compatible model

Piper

OpenWakeWord

Vosk

Platypush Hooks

Local OpenAI compatible model

Piper

On a Raspberry Pi, I would still keep expectations realistic. Hotword detection, Vosk and Piper are fine on small machines. Local LLMs are the heavy piece. A Pi 5 with enough RAM can run small quantized models, but latency will not feel like a cloud model or a GPU-backed workstation. For many home automation workflows, that is acceptable because the LLM is only the fallback; the frequent commands stay deterministic.

Why this architecture ages well

Voice assistants have been a graveyard of abandoned SDKs and cloud products. Snowboy is gone. Mycroft is gone. The old Google Assistant SDK is deprecated. Vendor assistants are increasingly shaped around vendor ecosystems rather than user-controlled automation.

The safer long-term bet is not one monolithic assistant. It is a pipeline of small replaceable parts:

  • Swap the hotword model without touching the automation logic.
  • Swap Vosk for another STT engine without touching Hue or MPD.
  • Swap OpenAI for a local OpenAI-compatible model without touching the wake word, TTS or command hooks.
  • Swap Piper voices without touching the assistant flow.

Platypush is a good fit for this because its event system is already the boundary between perception and action. Speech recognition emits an event. Hooks decide what to do. Plugins execute the actions.

That separation is what makes the assistant inspectable. It is also what makes it possible to keep most of it on a Raspberry Pi in your house, instead of outsourcing the entire audio loop to a cloud service that may disappear, get worse, or decide one day that your use case is no longer part of the roadmap.

Final notes

The minimal version of this setup is small:

  • assistant.openwakeword for the always-on wake word.
  • assistant.vosk for local command transcription.
  • A few @when(SpeechRecognizedEvent, phrase=...) hooks for deterministic commands.
  • light.hue, music.mpd or any other Platypush plugin for actions.
  • tts.piper for local spoken responses.
  • openai.get_response only where language understanding is worth the cost.

Start with the deterministic commands. Add the model fallback later. That way the assistant stays fast for the commands you use every day, while still being flexible enough to answer questions or interpret messy speech when you need it.

🔁 3