Building Call-Thing: A 7-Month AI Phone Agent Post-Mortem
Over seven months, I built Call-Thing—a fully autonomous AI phone agent designed to handle restaurant reservations. It dialed out via SIP, listened, transcribed, generated LLM responses, and spoke back in real-time.
Then OpenAI released Voice Mode, and I killed the project.
Here is a look under the hood of what I built before a single API update made it obsolete.
The Architecture: 7 Parallel Processes
To achieve real-time conversational speeds, I split the pipeline into seven parallel worker processes communicating via TCP sockets and a custom 10-byte header IPC protocol.
- Audio Capture: Baresip streaming Opus 48kHz audio.
- VAD & Silence Detection: Silero VAD processing 400ms chunks, paired with a custom silence detector to stop Whisper from hallucinating on dead air.
- Transcription & Agentic LLM: Whisper translated speech to text, passing it to a fine-tuned GPT-3.5-turbo. By hooking it up with function calling for live reservation modifications, it was essentially doing "agentic AI" before everyone started using the buzzword.
- Caching: Qdrant checked for semantic similarity (>0.85) to serve cached responses, saving both latency and cost.
- TTS: Google Cloud TTS generated audio for immediate SIP playback.
The Hardest Battles
Python's GIL: Threading could not beat the Global Interpreter Lock. Switching to multiprocessing in spawn mode (required for CUDA) was the breakthrough that finally enabled low-latency conversations.
Local vs. Cloud: After weeks of testing local fine-tunes (Gemma, Mistral) on WSL, I pivoted to GPT-3.5-turbo. Relying on local models meant downtime whenever my PC in Germany was turned off—a dealbreaker for availability.
The TTS Rabbit Hole: I spent a good chunk of time trying to get Coqui TTS running locally for faster response times. The quality was promising, but getting it to produce reliable, low-latency output in a real-time pipeline was a battle I eventually lost—Google Cloud TTS won on consistency.
The Docker Diet: Packing ML models and audio libraries together initially resulted in a massive 65GB Docker image. Shrinking that down to 15GB through multi-stage builds and strict dependency pruning was a war story in itself.
The End
By the end, Call-Thing was deployed on Scaleway. A TypeScript Job Coordinator watched a Firestore queue, spawning parallel Docker containers on demand. It supported English, German, Spanish, and Dutch, autonomously detecting and switching languages at runtime.
It was an incredible technical gauntlet—right up until OpenAI's Voice Mode turned seven months of latency optimization and pipeline orchestration into a few lines of code. May it rest in peace.
Maybe one day I'll resurrect it using ElevenLabs or Gemini Live.