Files

noturum 1edfd5d62f Initial commit: audio-chat with fixes

- Created AGENTS.md with architecture documentation
- Fixed race conditions and async patterns
- Added conversation history to LLM prompts
- Fixed TTS audio shape handling
- Added buffer limits and graceful shutdown
- Fixed client.py with file sending support
- Removed duplicate requirements
- Added .gitignore

2026-05-01 13:01:06 +00:00

1.7 KiB

Raw Permalink Blame History

Audio Chat

Running

uvicorn main:app --host 0.0.0.0 --port 8000

Client (for testing):

python client.py

Architecture

Single-process FastAPI server. On each WebSocket connection a new AudioSession is created with three engines:

Module	Purpose	Model (default)
`engine/stt.py`	Speech-to-text	Systran/faster-whisper-large-v3
`engine/llm.py`	LLM response generation	Qwen/Qwen2.5-7B-Instruct
`engine/tts.py`	Text-to-speech	facebook/mms-tts-rus

Models are loaded lazily on first use if initialize() was not called. STT always runs in Russian (language="ru" with VAD).

WebSocket Protocol

Direction	Format	Meaning
Client → Server	`b"A" + PCM data`	Send audio chunk
Client → Server	`b"R"`	Reset conversation
Server → Client	`b"O" + WAV bytes`	LLM response as audio
Server → Client	`"TEXT:<transcription>"`	Recognized speech

Audio format: 16-bit PCM mono, 16 kHz input / 24 kHz output.

Configuration

All settings via .env (loaded by config.py). Key vars:

DEVICE — "cuda" or "cpu" (default "auto")
AUDIO_BUFFER_SECONDS / CHUNK_SIZE — silence detection thresholds
LLM_MAX_TOKENS / LLM_TEMPERATURE — generation parameters

Gotchas

No test suite or linting configured.
Models download on first use; ensure network access to HuggingFace.
AudioSession holds conversation history (last 6 turns) in memory — each WebSocket reconnect resets it.
Thread pool executor is fixed at 2 workers; concurrent heavy requests will queue.
TTS pipeline falls back to CPU (device=-1) if GPU initialization fails silently.

1.7 KiB Raw Permalink Blame History