- OpenAI-compatible Flask proxy (POST /audio/speech, GET /models) - faster-qwen3-tts HIP graph acceleration: GPU LLM at 1.78x RTF - CPU speech tokenizer decoder: bypasses MIOpen ConvDirectNaiveConvFwd, eliminates 4-40s per-request decode overhead - attn_implementation=sdpa for transformer attention - AOTRITON env var toggle (off=short sentences, on=long-form/novel chapters) - HIP_GRAPHS env var toggle (default on) - Startup warmup with HIP graph capture (~5s) - CORS support for browser extension requests - RTF: 0.9-1.5x on AMD RX 7900 XTX (gfx1100, ROCm 6.3) Performance vs baseline (CPU-only, ~3 min/sentence): 12c: 3.2s | 44c: 2.7s | 115c: 6.6s
2.6 KiB
2.6 KiB
qwen3-tts-ra
Qwen3-TTS with Read-Aloud browser extension integration.
Components
qwen3-proxy/— OpenAI-compatible TTS proxy (POST /audio/speech)Qwen3-TTS/— Qwen3-TTS library (submodule / clone)read-aloud/— Read-Aloud browser extension (submodule / clone)setup_qwen3_readaloud.sh— Initial environment setup script
Architecture
Read-Aloud extension
→ POST http://localhost:5000/audio/speech
→ qwen3-proxy/app.py (Flask, OpenAI-compatible API)
→ faster-qwen3-tts (HIP graph acceleration, AMD gfx1100)
→ GPU: LLM token generation at ~1.78x RTF
→ CPU: speech tokenizer decode (bypasses MIOpen)
Performance (AMD Radeon RX 7900 XTX, gfx1100)
| Input | Audio | Time | RTF |
|---|---|---|---|
| 12c "Hello world." | ~2s | ~3s | ~0.9x |
| 44c sentence | ~4s | ~3s | 1.5x |
| 115c paragraph | ~10s | ~7s | 1.5x |
RTF > 1.0 = generates faster than real-time.
Key optimisations
- HIP Graphs (
faster-qwen3-tts) — captures autoregressive decode loop as a static GPU program, eliminating Python overhead per token - CPU speech decoder — moves
speech_tokenizer.modelto CPU, bypassing MIOpen's slowConvDirectNaiveConvFwdfallback entirely attn_implementation=sdpa— PyTorch native SDPA for transformer attentionMIOPEN_USER_DB_PATH— persistent MIOpen find-DB for LLM-side convolutions
Setup
# Install Python venv + deps
./setup_qwen3_readaloud.sh
# Start the proxy service
systemctl --user start qwen3-tts-proxy.service
# Watch logs
journalctl --user -u qwen3-tts-proxy.service -f
Read-Aloud Extension Settings
In Read-Aloud → Settings → OpenAI:
| Field | Value |
|---|---|
| URL | http://127.0.0.1:5000 |
| API Key | (leave blank) |
| Voice list | see below |
[
{"voice": "alloy", "lang": "en-US", "model": "tts-1"},
{"voice": "echo", "lang": "en-US", "model": "tts-1"},
{"voice": "fable", "lang": "en-US", "model": "tts-1"},
{"voice": "onyx", "lang": "en-US", "model": "tts-1"},
{"voice": "nova", "lang": "zh-CN", "model": "tts-1"},
{"voice": "shimmer", "lang": "zh-CN", "model": "tts-1"}
]
Env vars (systemd service)
| Variable | Default | Notes |
|---|---|---|
QWEN_MODEL |
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
HF model id or local path |
DEVICE |
cuda:0 |
GPU device |
HIP_GRAPHS |
1 |
Enable faster-qwen3-tts HIP graphs |
AOTRITON |
0 |
AOTriton flash attention — faster for long text (>80 chars), slower for short sentences |
PROXY_PORT |
5000 |
Listening port |