Practicer’s notes on building Voice Agents

I. Introduction

Voice agents are more than a cool demo — they're being used in production settings across support, sales, and automation. At a high level, a real-time voice agent consists of:

ASR (Automatic Speech Recognition)
LLM (Language Model)
TTS (Text-to-Speech)

But in practice, building one involves deep trade-offs across latency, orchestration, infrastructure, and tooling. This post shares a practitioner's take on how to build, optimize, and decide between platforms, frameworks, or custom implementations.

II. Core Components & Considerations

1. ASR (Speech-to-Text)

Providers like Deepgram are excellent for low latency (~150–200ms).
Challenges:
- Voice Activity Detection (VAD) is hard to tune:
  - When do you stop listening?
    - Too early = you cut off the user
    - Too late = extra silence, perceived lag
  - How to detect speech in noisy environments?
    - Background noise vs active speech can confuse the system
    - Noise suppression helps, but adds to overall latency
- Turn-Taking - When to decide to speak and be interrupted?
Tips:
- Combine energy threshold + pause-based VAD.
- Identify speech "stop words" to signal end of turn.

2. LLM (Understanding & Response)

Bigger context windows = higher inference time.
Optimizations:
- Prompt caching
- Template reuse
- Limit context history
Key Metric: time to first token — don't just measure full response time.
Streaming inference can significantly improve perceived latency.

I. Introduction

II. Core Components & Considerations

1. ASR (Speech-to-Text)

2. LLM (Understanding & Response)

3. TTS (Text-to-Speech)