Go to file

pi-bot-01 d3ca5ab0b2 feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

- OpenAI-compatible Flask proxy (POST /audio/speech, GET /models)
- faster-qwen3-tts HIP graph acceleration: GPU LLM at 1.78x RTF
- CPU speech tokenizer decoder: bypasses MIOpen ConvDirectNaiveConvFwd,
  eliminates 4-40s per-request decode overhead
- attn_implementation=sdpa for transformer attention
- AOTRITON env var toggle (off=short sentences, on=long-form/novel chapters)
- HIP_GRAPHS env var toggle (default on)
- Startup warmup with HIP graph capture (~5s)
- CORS support for browser extension requests
- RTF: 0.9-1.5x on AMD RX 7900 XTX (gfx1100, ROCm 6.3)

Performance vs baseline (CPU-only, ~3 min/sentence):
  12c: 3.2s | 44c: 2.7s | 115c: 6.6s

2026-03-25 21:18:42 -07:00

qwen3-proxy

feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

2026-03-25 21:18:42 -07:00

.gitignore

feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

2026-03-25 21:18:42 -07:00

README.md

feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

2026-03-25 21:18:42 -07:00

setup_qwen3_readaloud.sh

feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

2026-03-25 21:18:42 -07:00

README.md

qwen3-tts-ra

Qwen3-TTS with Read-Aloud browser extension integration.

Components

qwen3-proxy/ — OpenAI-compatible TTS proxy (POST /audio/speech)
Qwen3-TTS/ — Qwen3-TTS library (submodule / clone)
read-aloud/ — Read-Aloud browser extension (submodule / clone)
setup_qwen3_readaloud.sh — Initial environment setup script

Architecture

Read-Aloud extension
  → POST http://localhost:5000/audio/speech
    → qwen3-proxy/app.py (Flask, OpenAI-compatible API)
      → faster-qwen3-tts (HIP graph acceleration, AMD gfx1100)
        → GPU: LLM token generation at ~1.78x RTF
        → CPU: speech tokenizer decode (bypasses MIOpen)

Performance (AMD Radeon RX 7900 XTX, gfx1100)

Input	Audio	Time	RTF
12c "Hello world."	~2s	~3s	~0.9x
44c sentence	~4s	~3s	1.5x
115c paragraph	~10s	~7s	1.5x

RTF > 1.0 = generates faster than real-time.

Key optimisations

HIP Graphs (faster-qwen3-tts) — captures autoregressive decode loop as a static GPU program, eliminating Python overhead per token
CPU speech decoder — moves speech_tokenizer.model to CPU, bypassing MIOpen's slow ConvDirectNaiveConvFwd fallback entirely
attn_implementation=sdpa — PyTorch native SDPA for transformer attention
MIOPEN_USER_DB_PATH — persistent MIOpen find-DB for LLM-side convolutions

Setup

# Install Python venv + deps
./setup_qwen3_readaloud.sh

# Start the proxy service
systemctl --user start qwen3-tts-proxy.service

# Watch logs
journalctl --user -u qwen3-tts-proxy.service -f

Read-Aloud Extension Settings

In Read-Aloud → Settings → OpenAI:

Field	Value
URL	`http://127.0.0.1:5000`
API Key	(leave blank)
Voice list	see below

[
  {"voice": "alloy",   "lang": "en-US", "model": "tts-1"},
  {"voice": "echo",    "lang": "en-US", "model": "tts-1"},
  {"voice": "fable",   "lang": "en-US", "model": "tts-1"},
  {"voice": "onyx",    "lang": "en-US", "model": "tts-1"},
  {"voice": "nova",    "lang": "zh-CN", "model": "tts-1"},
  {"voice": "shimmer", "lang": "zh-CN", "model": "tts-1"}
]

Env vars (systemd service)

Variable	Default	Notes
`QWEN_MODEL`	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	HF model id or local path
`DEVICE`	`cuda:0`	GPU device
`HIP_GRAPHS`	`1`	Enable faster-qwen3-tts HIP graphs
`AOTRITON`	`0`	AOTriton flash attention — faster for long text (>80 chars), slower for short sentences
`PROXY_PORT`	`5000`	Listening port