feat: Qwen3-TTS proxy with HIP graph + CPU decoder optimisations

- OpenAI-compatible Flask proxy (POST /audio/speech, GET /models)
- faster-qwen3-tts HIP graph acceleration: GPU LLM at 1.78x RTF
- CPU speech tokenizer decoder: bypasses MIOpen ConvDirectNaiveConvFwd,
  eliminates 4-40s per-request decode overhead
- attn_implementation=sdpa for transformer attention
- AOTRITON env var toggle (off=short sentences, on=long-form/novel chapters)
- HIP_GRAPHS env var toggle (default on)
- Startup warmup with HIP graph capture (~5s)
- CORS support for browser extension requests
- RTF: 0.9-1.5x on AMD RX 7900 XTX (gfx1100, ROCm 6.3)

Performance vs baseline (CPU-only, ~3 min/sentence):
  12c: 3.2s | 44c: 2.7s | 115c: 6.6s
This commit is contained in:
2026-03-25 21:18:42 -07:00
commit d3ca5ab0b2
5 changed files with 627 additions and 0 deletions

49
.gitignore vendored Normal file
View File

@@ -0,0 +1,49 @@
# Python
__pycache__/
*.py[cod]
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/
*.egg
.eggs/
# Virtual envs
venv/
.venv/
env/
*.venv
# Model weights / audio output
*.wav
*.mp3
*.bin
*.safetensors
*.pt
*.pth
# HuggingFace cache
.cache/
# Test artifacts
test_output.*
test_simple.py
# OS
.DS_Store
Thumbs.db
# IDE
.vscode/
.idea/
*.swp
*.swo
# Submodule source trees (large, checked out separately)
Qwen3-TTS/
read-aloud/
# Systemd units are user-specific, generated by setup script
${HOME_DIR}/