voice, text, AI --- a synthesis

2026-04-20

A synthesis of a longer exchange with Claude, in two passes.

The first pass asks what voice and text actually are to each other, once you stop ranking them. The second applies that framing to current voice-AI, and asks why so much of it feels hollow.

voice and text

Voice and text aren’t a hierarchy; they’re different media with different affordances, and “superior” is a category error outside a specific task context.
Production speed favors voice (~150 wpm spoken vs. ~40–80 wpm typed), but consumption speed favors text (~250+ wpm read vs. ~150–200 wpm listened to before comprehension degrades).
Writing forces an editing pass that speech skips, so the same content is typically denser in text than in transcribed speech.
Text wins on error rate in precision-critical domains: code, legal, technical specs, anything with unique spellings. Speech has no backspace.
Pleasantness is individual, not general --- introverts, autistic folks, auditory-processing-different people often find text less taxing than speech.
Voice carries prosody, breath, timbre, hesitation, pace --- information that dies in transcription.
I never said she stole the money has seven meanings spoken, one written. The transcript flattens the space of intended meanings.
Voice is ephemeral by default; text is persistent by default. Both properties have unique value depending on the use case.
Some value is inherent to voice: a voiceprint of a dead relative, the cracks in a therapy session, the breath before an admission. Text cannot carry these.
Some value is inherent to text: searchability, revisability, skimmability, the discipline of revision, the permanence of record.
The honest question isn’t which medium wins but what is lost when we substitute one for the other --- because substitution is happening constantly and mostly unexamined.
New media aren’t strictly superior to what they displace. Email wasn’t better mail; video calls aren’t better meetings; voice isn’t better text.
Voice-produced / text-consumed hybrids (dictation, transcripts) win on both production and consumption speed, but lose prosody and change behavior because the record becomes permanent.
Text-produced / voice-rendered hybrids (TTS, audiobooks) gain accessibility but add fake prosody --- the narrator’s interpretive choices become invisible to the listener.
Cognition can be voice-shaped, text-shaped, or pre-linguistic, and sophisticated thinkers shift substrate by task rather than living in one mode.
Extended-mind framing: the medium isn’t just a tool thought uses, it may partly constitute the thought. Writing something is different from thinking it.

voice and AI

Most current voice AI is voice-produced, text-mediated, voice-rendered --- a voice wrapper around text-shaped cognition.
Every transition in that pipeline drops information: speech-to-text strips prosody; text processing has none; text-to-speech synthesizes fake prosody from content that never had it.
This creates an uncanny valley not because the voice is bad but because the medium and the cognition mismatch, and the human brain detects the mismatch.
A human voice triggers prosodic inference machinery automatically; when nothing feeds that machinery, the interaction feels hollow. That’s the Siri problem.
Text chat with an LLM doesn’t trigger this because the reading engine expects text-shaped cognition. Medium matches processing. No valley.
Genuine prosody-native AI requires treating acoustic features as semantic content, not as a transcription bottleneck to push past.
End-to-end speech models (no text intermediate) are the architectural path toward this, but they’re a research bet, not yet a product category.
Frontier labs have the capacity but awkward incentives: prosody-native models are hard to benchmark and hard to monetize against text-benchmark-similar competitors.
GPT-4o’s advanced voice mode gestures at this but is still mostly presentation over substance.
Real-time human-to-human voice systems (Roblox, Discord, phone calls) succeed by being transparent --- moving audio cleanly enough that humans do prosodic inference themselves.
Voice AI has the opposite problem: it must be opaque (there’s cognition in the middle) while pretending to be transparent. Much harder product surface.
Latency, packet loss, jitter, codec quality, and spatial audio are the dominant constraints for player-to-player voice at scale --- not semantic modeling.
Voice AI inherits text’s limitations (no genuine prosodic understanding, no timing-as-content) without text’s benefits (no searchability, no revision, no persistent scannable record).
The current best honest design choices are: drop the voice pretense and let users type, or commit to speech-native architecture and accept reduced capability.
The worst option is the current default: voice chrome, text brain, pretend it’s a conversation.
Prosody-native AI is also a harder evaluation problem than text --- there’s no ground truth for understood the sarcasm correctly, which slows research progress.
Hands-free and accessibility use cases are where voice AI is most defensible now, because the producer-side bottleneck is real and the semantic loss is acceptable.
Dictation is a solved-enough problem to be useful; genuine voice conversation isn’t.
The first prosody-native systems probably emerge from frontier labs treating it as a research bet, not from product teams treating it as a feature.
Until then, the mature stance toward voice AI is: know which hybrid you’re using, know what’s being lost in each transition, and pick the tool that matches the task --- the same conclusion we reached about voice and text in general.