Text-to-Speech (TTS)¶
Edge TTS is included in the main Python dependencies; cloud and self-hosted backends use the existing HTTP client stack.
Playback Controls¶
TTS settings live in the Settings panel:
- Audio/TTS enabled — global toggle to enable or disable all speech playback
- Auto-speak — automatically play speech for new assistant messages
- Volume — playback volume level
The speaker icon on each assistant message triggers speech for that message. Playback state is shown on the button itself.
Character Voice Settings¶
Each character keeps its own voice profile in the character editor Voice tab:
- Enabled/disabled
- Backend/API config
- Language and voice
- Speed and pitch
- Preview playback
How It Works¶
Clicking the speaker icon on a character message, or enabling auto-speak for new assistant messages, triggers a three-step pipeline:
- Speech Extraction — the regex extractor extracts spoken dialogue locally, strips inner monologue/scene text, and converts recognized action beats (
*laughs*,*sighs*) into pauses or emotion tags for capable backends. Handles straight ("...") and curly (\u201c...\u201d) double quotes. - TTS Synthesis — the speakable text is sent to the configured backend (Edge TTS, OpenAI-compatible, Fish Speech, ElevenLabs, Kokoro-82M).
- Playback — the generated audio plays in-browser; results are cached on disk so repeated plays are instant.
Clicking a quoted dialogue line in an assistant message speaks just that line (click-to-speak). Chunks are fetched on first click and cached in the DOM. The currently spoken line gets a subtle highlight during playback (karaoke mode).
Available Backends¶
| Backend | Install | API Key | Voices | Models | Notes |
|---|---|---|---|---|---|
| Microsoft Edge TTS | Included in requirements.txt |
None (free) | Fetched live, filterable by language | — | 400+ voices, 80+ languages |
| OpenAI-Compatible | None (httpx) | Required | 10 built-in voices (alloy, echo, nova, shimmer...) | Fetched live from /v1/models |
Works with any provider implementing POST /v1/audio/speech |
| Kokoro-82M | See requirements-tts.txt |
None | 54 voices, 9 languages | — | Self-hosted local model. hexgrad/kokoro |
| Fish Speech | None (httpx) | Optional | Fetched live from /v1/references/list |
— | Self-hosted, supports voice cloning via references |
| ElevenLabs | None (httpx) | Required | Fetched live from ElevenLabs API | — | 300+ cloud voices, emotion tags, highest quality |
Adding New Backends¶
Each backend is a single file in backend/tts/ implementing the TTSAdapter base class. The router auto-registers adapters whose dependencies are installed (try/except import). See backend/tts/edge_adapter.py as a reference.
Key methods to implement:
list_voices()— return available voices (can be static or fetched from API)list_models()— optional, return available models (for backends with multiple models)synthesize()— convert speakable chunks into audio bytesbackend_name,supports_streaming,supports_emotion_tagsproperties