I. Introduction
Voice agents are more than a cool demo — they're being used in production settings across support, sales, and automation. At a high level, a real-time voice agent consists of:
- ASR (Automatic Speech Recognition)
- LLM (Language Model)
- TTS (Text-to-Speech)
But in practice, building one involves deep trade-offs across latency, orchestration, infrastructure, and tooling. This post shares a practitioner's take on how to build, optimize, and decide between platforms, frameworks, or custom implementations.
II. Core Components & Considerations
1. ASR (Speech-to-Text)
- Providers like Deepgram are excellent for low latency (~150–200ms).
- Challenges:
- Voice Activity Detection (VAD) is hard to tune:
- When do you stop listening?
- Too early = you cut off the user
- Too late = extra silence, perceived lag
- How to detect speech in noisy environments?
- Background noise vs active speech can confuse the system
- Noise suppression helps, but adds to overall latency
- Turn-Taking - When to decide to speak and be interrupted?
- Tips:
- Combine energy threshold + pause-based VAD.
- Identify speech "stop words" to signal end of turn.
2. LLM (Understanding & Response)
- Bigger context windows = higher inference time.
- Optimizations:
- Prompt caching
- Template reuse
- Limit context history
- Key Metric:
time to first token
— don't just measure full response time.
- Streaming inference can significantly improve perceived latency.
3. TTS (Text-to-Speech)