- Created AGENTS.md with architecture documentation - Fixed race conditions and async patterns - Added conversation history to LLM prompts - Fixed TTS audio shape handling - Added buffer limits and graceful shutdown - Fixed client.py with file sending support - Removed duplicate requirements - Added .gitignore
1.7 KiB
1.7 KiB
Audio Chat
Running
uvicorn main:app --host 0.0.0.0 --port 8000
Client (for testing):
python client.py
Architecture
Single-process FastAPI server. On each WebSocket connection a new AudioSession is created with three engines:
| Module | Purpose | Model (default) |
|---|---|---|
engine/stt.py |
Speech-to-text | Systran/faster-whisper-large-v3 |
engine/llm.py |
LLM response generation | Qwen/Qwen2.5-7B-Instruct |
engine/tts.py |
Text-to-speech | facebook/mms-tts-rus |
Models are loaded lazily on first use if initialize() was not called. STT always runs in Russian (language="ru" with VAD).
WebSocket Protocol
| Direction | Format | Meaning |
|---|---|---|
| Client → Server | b"A" + PCM data |
Send audio chunk |
| Client → Server | b"R" |
Reset conversation |
| Server → Client | b"O" + WAV bytes |
LLM response as audio |
| Server → Client | "TEXT:<transcription>" |
Recognized speech |
Audio format: 16-bit PCM mono, 16 kHz input / 24 kHz output.
Configuration
All settings via .env (loaded by config.py). Key vars:
DEVICE—"cuda"or"cpu"(default"auto")AUDIO_BUFFER_SECONDS/CHUNK_SIZE— silence detection thresholdsLLM_MAX_TOKENS/LLM_TEMPERATURE— generation parameters
Gotchas
- No test suite or linting configured.
- Models download on first use; ensure network access to HuggingFace.
AudioSessionholds conversation history (last 6 turns) in memory — each WebSocket reconnect resets it.- Thread pool executor is fixed at 2 workers; concurrent heavy requests will queue.
- TTS pipeline falls back to CPU (
device=-1) if GPU initialization fails silently.