Skip to main content
AI

Building Natural AI Voice Conversation

| 7 min read | Oskars Tūns
DA

Oskars Tūns

AI Systems & Engineering

Most AI voice demos work like a walkie-talkie. You press a button, say your piece, release, and wait. The AI responds. Then you go again. It's orderly, predictable, and nothing like how people actually talk to each other.

Real conversation is messier. People interrupt. They trail off mid-thought. They start answering before the other person finishes. They talk over background noise. We wanted to build something that handled all of this gracefully — so we built Dialogo, a series of experiments in natural bidirectional voice using the Gemini Live API.

The First Insight: Never Mute the Mic

The obvious starting point was buffering. When the AI is speaking, the user's mic is technically "blocked" — feeding audio to the model during bot playback causes feedback and confusion. The naive fix is to mute the mic. But then you lose everything the user says while the bot talks.

Our v1 solution was simple: never mute. Instead, buffer the PCM audio locally while the bot speaks. The moment the bot finishes, flush that buffer directly to the model. The user's words were already captured — they don't wait, they don't repeat themselves.

The user can start talking at any point. Their audio is always captured. The system decides when to send it.

The Latency Wall

Buffering solved the capture problem, but there was a second problem: latency. Even with the buffer flushed immediately, there was a painful 4–7 second gap between when the user stopped speaking and when the AI started responding.

The breakdown looked roughly like this:

User finishes speaking
→ VAD silence window: 700ms (mandatory)
→ Model processing: ~1500ms
→ Audio generation: ~800ms
→ First audio heard: ~3000ms+

The silence_duration_ms VAD window is the biggest culprit. The model waits for 700ms of silence before deciding the user has finished speaking. That's intentional — it avoids cutting off speech mid-sentence — but it also means the model sits idle for nearly a second after every utterance.

Worse, this silence-based detection breaks completely in real environments. Background speech, a TV in another room, traffic outside — any sustained noise prevents the silence window from ever completing. The system just waits indefinitely.

The Shadow Session: Measuring Intent, Not Silence

The v3 solution was to stop measuring silence and start measuring intent. We run a second GeminiLiveSession in parallel — a lightweight "shadow" session — whose only job is to classify utterances in real time.

While the user speaks, the shadow session receives the same audio stream. After each natural pause in the user's speech, it outputs either DONE (complete thought, ready to respond) or WAIT (still speaking, or background noise). When it says DONE, we send an explicit turn_complete signal to the main session — bypassing the silence window entirely.

The model receives a complete utterance and starts generating immediately. The 700ms VAD window vanishes from the critical path. Background noise that previously stalled the whole conversation is now correctly classified as WAIT and ignored.

The Architecture

Main session ──────────────── receives audio + turn_complete
Shadow session ── classifies utterances ──→ DONE / WAIT

User speaks
Shadow: "WAIT" → keep listening
Shadow: "DONE" → send turn_complete to main session
Main: starts generating immediately (no VAD wait)

Getting a Head Start: Early Flush

One more piece. While the bot is talking, the user is often already formulating their next response. They might finish speaking — then wait in silence — while the bot is still mid-answer.

We detect this with client-side energy analysis on the mic buffer. If the user has gone quiet in the buffer and the bot still has more than ~400ms of audio queued to play, we flush the buffer early. The model starts processing the user's next question during the current bot response. By the time the bot finishes speaking, the answer is nearly ready.

After an early flush, we immediately re-arm the buffer so any subsequent speech is captured — the user never loses words.

The Small Human Touch

One last detail: if the user has been silent for 10 seconds, the system gently asks "Still there?" — not a timeout error, not a disconnection. Just a natural check-in. We send a [silent] token to the model and let it respond naturally in whatever tone fits the conversation.

The real lesson isn't technical. Natural conversation is messy, overlapping, and tolerant of noise. Building systems that feel natural means building systems that handle that messiness — not ones that demand clean input.

If you're building voice-first AI systems and want to discuss the architecture, we're happy to talk through what we learned.

Tags: Gemini Live API Voice AI Real-time Audio Turn Detection AI Systems
Share this article:

Related Articles

Building an AI-powered product?

Our team builds custom AI systems — voice interfaces, intelligent automation, and LLM integrations tailored to your business.

Get in Touch