# qwen3-tts-ra Qwen3-TTS with Read-Aloud browser extension integration. ## Components - `qwen3-proxy/` — OpenAI-compatible TTS proxy (`POST /audio/speech`) - `Qwen3-TTS/` — Qwen3-TTS library (submodule / clone) - `read-aloud/` — Read-Aloud browser extension (submodule / clone) - `setup_qwen3_readaloud.sh` — Initial environment setup script ## Architecture ``` Read-Aloud extension → POST http://localhost:5000/audio/speech → qwen3-proxy/app.py (Flask, OpenAI-compatible API) → faster-qwen3-tts (HIP graph acceleration, AMD gfx1100) → GPU: LLM token generation at ~1.78x RTF → CPU: speech tokenizer decode (bypasses MIOpen) ``` ## Performance (AMD Radeon RX 7900 XTX, gfx1100) | Input | Audio | Time | RTF | |-------|-------|------|-----| | 12c "Hello world." | ~2s | ~3s | ~0.9x | | 44c sentence | ~4s | ~3s | **1.5x** | | 115c paragraph | ~10s | ~7s | **1.5x** | RTF > 1.0 = generates faster than real-time. ## Key optimisations 1. **HIP Graphs** (`faster-qwen3-tts`) — captures autoregressive decode loop as a static GPU program, eliminating Python overhead per token 2. **CPU speech decoder** — moves `speech_tokenizer.model` to CPU, bypassing MIOpen's slow `ConvDirectNaiveConvFwd` fallback entirely 3. **`attn_implementation=sdpa`** — PyTorch native SDPA for transformer attention 4. **`MIOPEN_USER_DB_PATH`** — persistent MIOpen find-DB for LLM-side convolutions ## Setup ```bash # Install Python venv + deps ./setup_qwen3_readaloud.sh # Start the proxy service systemctl --user start qwen3-tts-proxy.service # Watch logs journalctl --user -u qwen3-tts-proxy.service -f ``` ## Read-Aloud Extension Settings In Read-Aloud → Settings → OpenAI: | Field | Value | |-------|-------| | URL | `http://127.0.0.1:5000` | | API Key | *(leave blank)* | | Voice list | see below | ```json [ {"voice": "alloy", "lang": "en-US", "model": "tts-1"}, {"voice": "echo", "lang": "en-US", "model": "tts-1"}, {"voice": "fable", "lang": "en-US", "model": "tts-1"}, {"voice": "onyx", "lang": "en-US", "model": "tts-1"}, {"voice": "nova", "lang": "zh-CN", "model": "tts-1"}, {"voice": "shimmer", "lang": "zh-CN", "model": "tts-1"} ] ``` ## Env vars (systemd service) | Variable | Default | Notes | |----------|---------|-------| | `QWEN_MODEL` | `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | HF model id or local path | | `DEVICE` | `cuda:0` | GPU device | | `HIP_GRAPHS` | `1` | Enable faster-qwen3-tts HIP graphs | | `AOTRITON` | `0` | AOTriton flash attention — faster for long text (>80 chars), slower for short sentences | | `PROXY_PORT` | `5000` | Listening port |