Flavio Espinoza
I use different AIs for different jobs -- Gemini for research, Claude for coding, ChatGPT for brainstorming. None of them had a hands-free voice interface I could actually use every day: they all have voice, but each is locked inside its own app and caps or throttles you mid-thought. I wanted one that was mine -- running on hardware I already pay for, uncapped, and swappable between models depending on which AI I want to think with.
So I built a model-agnostic AI voice interface: one audio pipe -- browser mic, voice-activity detection, speech-to-text, an AI at the seam, text-to-speech -- where the model is deliberately swappable. v1 runs on the Claude Agent SDK, the same engine as the Claude Code CLI, with the full toolset (Bash, a real shell, file access). The first version was actually wrong, and how I found out is the story: it was a Python middleware that quietly neutered Claude -- three tools, a system prompt it ignored, hallucinated timestamps because it had no shell. So I deleted the middleware and ran the Agent SDK directly. The variant won, and it was not close -- it became v2, Node and TypeScript end to end (React, a Node WebSocket server, the Agent SDK as the brain), the same stack Firecrawl ships on, and the whole thing was built in a single day.
Most of the real work was not the pipeline but the thousand small frictions between a transcript and a natural conversation -- a pre-roll buffer so the first word is not clipped, a cooldown so the mic does not trigger on its own audio, a code-fence stripper so Claude does not read code aloud character by character. You only find those by using the thing every day and fixing what annoys you, which is the whole point: a tool you build for yourself is the strongest proof you can ship and iterate without anyone handing you a ticket. Phase 2 swaps Gemini and ChatGPT in behind the same pipe -- the platform is the point; Claude is just the first inhabitant.
The Architecture
The system is one conversation loop, hands-free: speak, get heard, get answered, hear the answer, mic re-opens. v1 was the Python middleware -- now archived as the losing control. v2 is the live system: a browser audio front end, a Node/TypeScript WebSocket server, and the Claude Agent SDK as the brain, with Deepgram for both ears (speech-to-text and voice). On playback end the mic re-opens automatically, so the conversation continues with no clicking between turns.
Browser Audio Capture | v2 — The front end is a React and Vite app in TypeScript. Its only job on the audio path is to be a clean, dumb pipe -- capture raw microphone PCM and stream it, do no processing in the browser.
- A.An AudioWorklet (`pcm-recorder`) captures 16kHz mono int16 PCM in 100ms chunks and sends each chunk as a binary WebSocket frame -- raw audio out, no encoding or VAD in the browser.
- B.The React UI is a full chat surface, not a toy: streamed reply bubbles, a live "karaoke" highlight that tracks the sentence currently being spoken, a one-click voice toggle, and a context-usage indicator in the footer.
- C.Playback is event-driven -- the browser plays each mp3 the server sends, and on the audio `ended` event it posts `playback_done` back to the server, which is the signal that re-opens the mic. The loop closes itself.
Silero VAD | v2 — Voice-activity detection runs server-side so the browser stays a dumb pipe. It decides when I started talking and, more importantly, when I stopped -- the hard part of a hands-free interface.
- A.Silero VAD (ONNX) is the primary detector, with a hand-written RMS energy detector as an automatic fallback if the ONNX runtime is unavailable -- the loop never hard-fails on a missing model.
- B.A rolling pre-roll buffer keeps the last ~500ms of audio before speech is detected and prepends it to the utterance, so the first word is never clipped.
- C.The endpoint silence threshold is set at 2500ms on purpose -- it gives me room to pause and think mid-sentence without being cut off, a deliberate tradeoff of latency for not getting interrupted.
- D.Tuned by use, not by guess -- the energy fallback threshold was moved from 800 (too sensitive, ghost triggers) to 1200 (missed soft words) and settled at 900; a post-playback cooldown stops the speaker's own audio tail from re-triggering the mic.
Deepgram STT | v2 — The chosen ears of the system, after the mid-build correction away from local Parakeet. Speech-to-text runs as a streaming WebSocket to Deepgram so transcription happens live, not after I stop talking.
- A.Streams PCM to Deepgram nova-3 with interim results on, so the UI shows a live "typing" partial while I am still speaking and commits finalized segments as they settle.
- B.Keyterm prompting -- a `config/stt-keywords.txt` file feeds proper nouns (project names, tooling) to nova-3 so domain words transcribe correctly instead of becoming gibberish.
- C.Hardened against race conditions -- audio that arrives before the Deepgram socket is ready is buffered and flushed on open, and a short finalize grace window lets short utterances settle before the stream closes.
The AI at the Seam | v2 — The heart of the v1-to-v2 win, and the architectural decision that makes the platform model-agnostic. The seam is deliberately narrow: a transcript goes in, text deltas come out. That contract is the same whether Claude, Gemini, or ChatGPT is sitting behind it.
- A.v1 runs on the Claude Agent SDK -- the same engine as the Claude Code CLI, not the basic chat API. Full tool access: Bash, Read, Write, Edit, Grep, Glob. Claude can explore the repo, run commands, read real file contents instead of hallucinating them. The v1 Python middleware anti-hallucination guards became dead code the moment the SDK had a real shell.
- B.Picks up project context automatically -- the SDK loads `CLAUDE.md` from user and project setting sources the way the CLI does, so the voice instance obeys the same rules as every other Claude in my workflow with zero custom plumbing.
- C.Session continuity -- each turn captures the SDK session id and resumes it on the next turn, so the conversation has real memory across turns and survives a page reload.
- D.Designed for Opus 4.8 with 1M context -- the deep model is the point, since I pay for the Max plan precisely to get it. There is a live bug where the SDK falls back to Haiku despite the request, which is why the server already instruments a `MODEL_FALLBACK` log and a `cost_alert` to the UI the moment the returned model is not the one I asked for. The fix is pending; the detector that catches it is already shipped.
- E.The system prompt is a small piece of TTS engineering in its own right -- it teaches Claude to write text that sounds right when read aloud by Aura-2: real newlines between list items (never "One.Two.Three." mashed together), no markdown syntax, acronyms without internal periods, and the "C2 not C.2" rule that sidesteps the voice engine's decimal parser.
- F.Phase 2 swaps the model at the seam -- Gemini for research sessions, ChatGPT for brainstorming. Same VAD, same STT, same TTS pipeline, same persistence layer. Swapping the AI is one integration change, not a rebuild. The audio platform is the durable investment; the model behind it is a choice made per task.
Streaming TTS Pipeline | v2 — The output half, and the part with the most hard-won detail. Claude's reply streams back token by token; this pipeline turns that stream into natural speech without waiting for the whole reply to finish.
- A.Sentence-boundary streaming -- the pipeline buffers deltas, splits on sentence punctuation, and fires each finished sentence to Deepgram Aura-2 as its own synthesis call, so audio starts playing while Claude is still writing.
- B.Code is seen, not spoken -- a stateful filter strips fenced and inline code from the spoken path (carrying fence state across chunk boundaries) while the raw text still streams to the chat bubble, so you read the code and hear the prose.
- C.Deferred mode -- if a reply looks like a code request, or a backtick appears before speech has started, the pipeline holds all audio until the bubble has fully rendered, then reads from the top, so the voice and the visible code never drift apart.
- D.Paragraph pacing -- a double newline becomes a 1.2-second audible pause, giving real breathing room between topics instead of one run-on wall of speech.
- E.Interruptible and resumable -- pressing stop aborts generation, captures the partial text, and feeds it back as context on the next turn, so saying "keep going" makes Claude continue from exactly where it was cut off rather than starting over.
Persistence + Instrumentation | v2 — The plumbing that makes it a tool I trust daily rather than a demo. Every conversation is durable, searchable, and measured.
- A.Dual-format persistence -- every turn auto-saves the chat as both machine-readable JSON and human-readable Markdown, with an auto-generated title from the first substantial message; restart the app and the conversation is still there.
- B.A small REST surface (`/api/chats`, fetch-by-id, full-text search) over the saved chats, plus a per-connection session log that records every state transition and event to disk for debugging the real-time loop.
- C.A live context-usage indicator -- the server estimates tokens actually in play from the running chat record and pushes it to the footer each turn, so I can see context filling up. Paired with the model-fallback cost alert, the system watches its own spend and its own model choice and tells me when either drifts.
The Loop, Restated — Control, variant, measure, ship the winner -- one change at a time. v1's Python middleware was the control; deleting it for the Agent SDK was the variant that won and became v2, the new control. Inside v2 the same loop runs at small scale: a VAD threshold moved 800 to 1200 to 900 until soft speech registered without ghost triggers; the wrong STT engine swapped out mid-build in thirty-five minutes; a TTS pipeline tuned sentence by sentence until reading code aloud stopped sounding broken. Phase 2 is the same loop at model scale: run Gemini at the seam for research, run ChatGPT for brainstorming, measure which AI is actually better for each cognitive mode when the friction of switching is zero. Build one thing that works, make it the control, test every new idea against it, and promote only what beats it. Then begin again.