Voice Agent — AI Voice Receptionist

Phone AI receptionist built on two switchable architectures: Gemini Live (native audio-to/audio-out) and a Sequential STT→LLM→TTS pipeline. Sub-800ms end-to-end latency with local VAD barge-in.

Voice Agent — live call, transcription & tools

// AI Systems / Real-Time / Telephony

Voice Agent

Dual architecture · local VAD barge-in · full audio resampling chain · sub-800ms latency

Built a real-time AI voice receptionist capable of handling inbound phone calls, booking appointments, and sending confirmations without human intervention.
Achieves sub-800ms end-to-end latency in Live mode — from caller speech to agent audio response.
Two switchable call handler architectures:
- LiveCallHandler: Gemini Live API — native audio-in/audio-out, no STT/TTS chaining, sub-800ms latency
- SequentialCallHandler: Deepgram STT → Gemini 2.5 Flash → Deepgram Aura TTS — reliable fallback with 4s silence reprompting
- Toggle between them via a single import swap in server.ts
Implemented a fallback STT → LLM → TTS pipeline (Deepgram + Gemini) to support reliability and flexible deployment modes.
Local Silero VAD via @ericedouard/vad-node-realtime for zero-latency barge-in:
- Runs inference locally — no cloud round-trip means detection fires in sub-millisecond
- On interrupt: fires clear command to Twilio dropping queued audio, simultaneously sends turnComplete: true to Gemini WebSocket to stop the model mid-speech
Worked around Gemini Live's persistent context limitation using injected [SYSTEM STATE UPDATE] user messages after each tool execution — allowing dynamic system prompt behavior inside a stateful session.
Developed a state-driven conversation engine (GREETING → COLLECTING_INFO → CONFIRMING → BOOKED) with dynamic system prompts for context-aware responses.
Built end-to-end booking flow:
- captures user details via voice
- checks availability
- persists appointments (Supabase/PostgreSQL)
- sends SMS confirmations via Twilio
Full audio resampling chain across both architectures:
- Ingress: Twilio 8kHz µ-law → alawmulaw decode → PCM → upsample to 16kHz for Gemini Live ingestion
- Egress: Gemini Live outputs 24kHz PCM → downsample → 8kHz µ-law → chunked into 160-byte buffers for Twilio's 20ms frames
- Sequential path: Deepgram nova-2 STT → Gemini 2.5 Flash text → Deepgram Aura TTS → same µ-law chunking
Managed session state and real-time interactions using Upstash Redis with automatic cleanup on call termination.
Designed the system for production readiness with scalable WebSocket handling and extensible tool-based architecture.

// tech

TwilioGemini LiveDeepgramHonoWebSocketsSupabaseRedisNode.jsSilero VADalawmulawTypeScript

Previous: BugBot Next: BugBot