Initial commit: audio-chat with fixes
- Created AGENTS.md with architecture documentation - Fixed race conditions and async patterns - Added conversation history to LLM prompts - Fixed TTS audio shape handling - Added buffer limits and graceful shutdown - Fixed client.py with file sending support - Removed duplicate requirements - Added .gitignore
This commit is contained in:
51
AGENTS.md
Normal file
51
AGENTS.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Audio Chat
|
||||
|
||||
## Running
|
||||
|
||||
```bash
|
||||
uvicorn main:app --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
Client (for testing):
|
||||
```bash
|
||||
python client.py
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
Single-process FastAPI server. On each WebSocket connection a new `AudioSession` is created with three engines:
|
||||
|
||||
| Module | Purpose | Model (default) |
|
||||
|--------|---------|-----------------|
|
||||
| `engine/stt.py` | Speech-to-text | Systran/faster-whisper-large-v3 |
|
||||
| `engine/llm.py` | LLM response generation | Qwen/Qwen2.5-7B-Instruct |
|
||||
| `engine/tts.py` | Text-to-speech | facebook/mms-tts-rus |
|
||||
|
||||
Models are loaded lazily on first use if `initialize()` was not called. STT always runs in Russian (`language="ru"` with VAD).
|
||||
|
||||
## WebSocket Protocol
|
||||
|
||||
| Direction | Format | Meaning |
|
||||
|-----------|--------|---------|
|
||||
| Client → Server | `b"A" + PCM data` | Send audio chunk |
|
||||
| Client → Server | `b"R"` | Reset conversation |
|
||||
| Server → Client | `b"O" + WAV bytes` | LLM response as audio |
|
||||
| Server → Client | `"TEXT:<transcription>"` | Recognized speech |
|
||||
|
||||
Audio format: 16-bit PCM mono, 16 kHz input / 24 kHz output.
|
||||
|
||||
## Configuration
|
||||
|
||||
All settings via `.env` (loaded by `config.py`). Key vars:
|
||||
|
||||
- `DEVICE` — `"cuda"` or `"cpu"` (default `"auto"`)
|
||||
- `AUDIO_BUFFER_SECONDS` / `CHUNK_SIZE` — silence detection thresholds
|
||||
- `LLM_MAX_TOKENS` / `LLM_TEMPERATURE` — generation parameters
|
||||
|
||||
## Gotchas
|
||||
|
||||
- No test suite or linting configured.
|
||||
- Models download on first use; ensure network access to HuggingFace.
|
||||
- `AudioSession` holds conversation history (last 6 turns) in memory — each WebSocket reconnect resets it.
|
||||
- Thread pool executor is fixed at 2 workers; concurrent heavy requests will queue.
|
||||
- TTS pipeline falls back to CPU (`device=-1`) if GPU initialization fails silently.
|
||||
Reference in New Issue
Block a user