feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations
- OpenAI-compatible Flask proxy (POST /audio/speech, GET /models) - faster-qwen3-tts HIP graph acceleration: GPU LLM at 1.78x RTF - CPU speech tokenizer decoder: bypasses MIOpen ConvDirectNaiveConvFwd, eliminates 4-40s per-request decode overhead - attn_implementation=sdpa for transformer attention - AOTRITON env var toggle (off=short sentences, on=long-form/novel chapters) - HIP_GRAPHS env var toggle (default on) - Startup warmup with HIP graph capture (~5s) - CORS support for browser extension requests - RTF: 0.9-1.5x on AMD RX 7900 XTX (gfx1100, ROCm 6.3) Performance vs baseline (CPU-only, ~3 min/sentence): 12c: 3.2s | 44c: 2.7s | 115c: 6.6s
This commit is contained in:
82
README.md
Normal file
82
README.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# qwen3-tts-ra
|
||||
|
||||
Qwen3-TTS with Read-Aloud browser extension integration.
|
||||
|
||||
## Components
|
||||
|
||||
- `qwen3-proxy/` — OpenAI-compatible TTS proxy (`POST /audio/speech`)
|
||||
- `Qwen3-TTS/` — Qwen3-TTS library (submodule / clone)
|
||||
- `read-aloud/` — Read-Aloud browser extension (submodule / clone)
|
||||
- `setup_qwen3_readaloud.sh` — Initial environment setup script
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Read-Aloud extension
|
||||
→ POST http://localhost:5000/audio/speech
|
||||
→ qwen3-proxy/app.py (Flask, OpenAI-compatible API)
|
||||
→ faster-qwen3-tts (HIP graph acceleration, AMD gfx1100)
|
||||
→ GPU: LLM token generation at ~1.78x RTF
|
||||
→ CPU: speech tokenizer decode (bypasses MIOpen)
|
||||
```
|
||||
|
||||
## Performance (AMD Radeon RX 7900 XTX, gfx1100)
|
||||
|
||||
| Input | Audio | Time | RTF |
|
||||
|-------|-------|------|-----|
|
||||
| 12c "Hello world." | ~2s | ~3s | ~0.9x |
|
||||
| 44c sentence | ~4s | ~3s | **1.5x** |
|
||||
| 115c paragraph | ~10s | ~7s | **1.5x** |
|
||||
|
||||
RTF > 1.0 = generates faster than real-time.
|
||||
|
||||
## Key optimisations
|
||||
|
||||
1. **HIP Graphs** (`faster-qwen3-tts`) — captures autoregressive decode loop as a static GPU program, eliminating Python overhead per token
|
||||
2. **CPU speech decoder** — moves `speech_tokenizer.model` to CPU, bypassing MIOpen's slow `ConvDirectNaiveConvFwd` fallback entirely
|
||||
3. **`attn_implementation=sdpa`** — PyTorch native SDPA for transformer attention
|
||||
4. **`MIOPEN_USER_DB_PATH`** — persistent MIOpen find-DB for LLM-side convolutions
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# Install Python venv + deps
|
||||
./setup_qwen3_readaloud.sh
|
||||
|
||||
# Start the proxy service
|
||||
systemctl --user start qwen3-tts-proxy.service
|
||||
|
||||
# Watch logs
|
||||
journalctl --user -u qwen3-tts-proxy.service -f
|
||||
```
|
||||
|
||||
## Read-Aloud Extension Settings
|
||||
|
||||
In Read-Aloud → Settings → OpenAI:
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| URL | `http://127.0.0.1:5000` |
|
||||
| API Key | *(leave blank)* |
|
||||
| Voice list | see below |
|
||||
|
||||
```json
|
||||
[
|
||||
{"voice": "alloy", "lang": "en-US", "model": "tts-1"},
|
||||
{"voice": "echo", "lang": "en-US", "model": "tts-1"},
|
||||
{"voice": "fable", "lang": "en-US", "model": "tts-1"},
|
||||
{"voice": "onyx", "lang": "en-US", "model": "tts-1"},
|
||||
{"voice": "nova", "lang": "zh-CN", "model": "tts-1"},
|
||||
{"voice": "shimmer", "lang": "zh-CN", "model": "tts-1"}
|
||||
]
|
||||
```
|
||||
|
||||
## Env vars (systemd service)
|
||||
|
||||
| Variable | Default | Notes |
|
||||
|----------|---------|-------|
|
||||
| `QWEN_MODEL` | `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | HF model id or local path |
|
||||
| `DEVICE` | `cuda:0` | GPU device |
|
||||
| `HIP_GRAPHS` | `1` | Enable faster-qwen3-tts HIP graphs |
|
||||
| `AOTRITON` | `0` | AOTriton flash attention — faster for long text (>80 chars), slower for short sentences |
|
||||
| `PROXY_PORT` | `5000` | Listening port |
|
||||
Reference in New Issue
Block a user