pi-bot-01 fef6a1b74c feat: add PCM streaming + Kokoro voice name support
- POST /audio/speech with response_format=pcm now streams raw 16-bit
  PCM (24kHz mono) via Flask generator — compatible with customtts
  extension streaming mode
- resolve_voice() handles:
    * Standard OpenAI names (alloy, echo, ...)
    * Kokoro blend syntax: 'af_bella+bf_emma+af_nicole' (picks first)
    * Kokoro prefix heuristic: af_/bf_/am_/bm_ → Ryan, zf_/zm_ → Vivian
    * Explicit Kokoro aliases for common voices (bella, emma, sky, etc.)
    * Graceful fallback to alloy for unknown voices
- app.run(threaded=True) to support concurrent streaming connections
2026-03-25 21:39:56 -07:00

qwen3-tts-ra

Qwen3-TTS with Read-Aloud browser extension integration.

Components

  • qwen3-proxy/ — OpenAI-compatible TTS proxy (POST /audio/speech)
  • Qwen3-TTS/ — Qwen3-TTS library (submodule / clone)
  • read-aloud/ — Read-Aloud browser extension (submodule / clone)
  • setup_qwen3_readaloud.sh — Initial environment setup script

Architecture

Read-Aloud extension
  → POST http://localhost:5000/audio/speech
    → qwen3-proxy/app.py (Flask, OpenAI-compatible API)
      → faster-qwen3-tts (HIP graph acceleration, AMD gfx1100)
        → GPU: LLM token generation at ~1.78x RTF
        → CPU: speech tokenizer decode (bypasses MIOpen)

Performance (AMD Radeon RX 7900 XTX, gfx1100)

Input Audio Time RTF
12c "Hello world." ~2s ~3s ~0.9x
44c sentence ~4s ~3s 1.5x
115c paragraph ~10s ~7s 1.5x

RTF > 1.0 = generates faster than real-time.

Key optimisations

  1. HIP Graphs (faster-qwen3-tts) — captures autoregressive decode loop as a static GPU program, eliminating Python overhead per token
  2. CPU speech decoder — moves speech_tokenizer.model to CPU, bypassing MIOpen's slow ConvDirectNaiveConvFwd fallback entirely
  3. attn_implementation=sdpa — PyTorch native SDPA for transformer attention
  4. MIOPEN_USER_DB_PATH — persistent MIOpen find-DB for LLM-side convolutions

Setup

# Install Python venv + deps
./setup_qwen3_readaloud.sh

# Start the proxy service
systemctl --user start qwen3-tts-proxy.service

# Watch logs
journalctl --user -u qwen3-tts-proxy.service -f

Read-Aloud Extension Settings

In Read-Aloud → Settings → OpenAI:

Field Value
URL http://127.0.0.1:5000
API Key (leave blank)
Voice list see below
[
  {"voice": "alloy",   "lang": "en-US", "model": "tts-1"},
  {"voice": "echo",    "lang": "en-US", "model": "tts-1"},
  {"voice": "fable",   "lang": "en-US", "model": "tts-1"},
  {"voice": "onyx",    "lang": "en-US", "model": "tts-1"},
  {"voice": "nova",    "lang": "zh-CN", "model": "tts-1"},
  {"voice": "shimmer", "lang": "zh-CN", "model": "tts-1"}
]

Env vars (systemd service)

Variable Default Notes
QWEN_MODEL Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice HF model id or local path
DEVICE cuda:0 GPU device
HIP_GRAPHS 1 Enable faster-qwen3-tts HIP graphs
AOTRITON 0 AOTriton flash attention — faster for long text (>80 chars), slower for short sentences
PROXY_PORT 5000 Listening port
Description
No description provided
Readme 44 KiB
Languages
Python 56.8%
Shell 43.2%