Voice Agent — AI Voice Receptionist

Phone AI receptionist built on two switchable architectures: Gemini Live (native audio-to/audio-out) and a Sequential STT→LLM→TTS pipeline. Sub-800ms end-to-end latency with local VAD barge-in.

Voice Agent — live call, transcription & tools

// AI Systems / Real-Time / Telephony
Voice Agent UI: active call, real-time transcription, booking and SMS tool flows
Voice Agent
Dual architecture · local VAD barge-in · full audio resampling chain · sub-800ms latency
  • Built a real-time AI voice receptionist capable of handling inbound phone calls, booking appointments, and sending confirmations without human intervention.
  • Achieves sub-800ms end-to-end latency in Live mode — from caller speech to agent audio response.
  • Two switchable call handler architectures:
    • LiveCallHandler: Gemini Live API — native audio-in/audio-out, no STT/TTS chaining, sub-800ms latency
    • SequentialCallHandler: Deepgram STT → Gemini 2.5 Flash → Deepgram Aura TTS — reliable fallback with 4s silence reprompting
    • Toggle between them via a single import swap in server.ts
  • Implemented a fallback STT → LLM → TTS pipeline (Deepgram + Gemini) to support reliability and flexible deployment modes.
  • Local Silero VAD via @ericedouard/vad-node-realtime for zero-latency barge-in:
    • Runs inference locally — no cloud round-trip means detection fires in sub-millisecond
    • On interrupt: fires clear command to Twilio dropping queued audio, simultaneously sends turnComplete: true to Gemini WebSocket to stop the model mid-speech
  • Worked around Gemini Live's persistent context limitation using injected [SYSTEM STATE UPDATE] user messages after each tool execution — allowing dynamic system prompt behavior inside a stateful session.
  • Developed a state-driven conversation engine (GREETING → COLLECTING_INFO → CONFIRMING → BOOKED) with dynamic system prompts for context-aware responses.
  • Built end-to-end booking flow:
    • captures user details via voice
    • checks availability
    • persists appointments (Supabase/PostgreSQL)
    • sends SMS confirmations via Twilio
  • Full audio resampling chain across both architectures:
    • Ingress: Twilio 8kHz µ-law → alawmulaw decode → PCM → upsample to 16kHz for Gemini Live ingestion
    • Egress: Gemini Live outputs 24kHz PCM → downsample → 8kHz µ-law → chunked into 160-byte buffers for Twilio's 20ms frames
    • Sequential path: Deepgram nova-2 STT → Gemini 2.5 Flash text → Deepgram Aura TTS → same µ-law chunking
  • Managed session state and real-time interactions using Upstash Redis with automatic cleanup on call termination.
  • Designed the system for production readiness with scalable WebSocket handling and extensible tool-based architecture.

// tech

TwilioGemini LiveDeepgramHonoWebSocketsSupabaseRedisNode.jsSilero VADalawmulawTypeScript