How Generative Voice AI Actually Works (in Plain English)

Generative voice AI has a genuinely impressive trick: you type a sentence, and a few seconds later a warm, breathing human voice reads it back to you. No microphone, no actor, no studio. The speed is real, and so is the quality. But the magic feeling is also a kind of fog, and the fog hides something useful. Once you understand the assembly line that turns letters into sound, you stop being dazzled and start being able to predict exactly where the line will jam, where a name will come out wrong, and where a sentence that scanned fine on screen will land flat in the ear.

Think of the whole system as a small factory with four stations, and the raw material is your typed text. The first station is text normalization: it reads the messy things people actually write and decides how to say them out loud. "$5" has to become "five dollars," "Dr. Lee" has to choose between "Doctor" and "Drive," and "2008" has to become "two thousand eight" rather than the digits two, zero, zero, eight. This stage has no glamour, but it is where a surprising share of embarrassing errors are born, because it is essentially a giant pile of rules and guesses about what a human meant.

The second station is the one most people have never heard of and is quietly the most important: grapheme-to-phoneme conversion, or G2P. A grapheme is a written character; a phoneme is a unit of sound. This stage translates spelling into pronunciation, the same skill you used as a child sounding out a word you had never seen. For English it sorts out that "read" rhymes with either "red" or "reed" depending on the sentence. For Chinese it faces the polyphone problem head-on, where one character carries several readings and only the context decides which is right. Get this station wrong and everything downstream sounds confidently incorrect.

The third and fourth stations are where modern AI does the heavy lifting. The acoustic model takes that string of sounds and plans how the speech should feel: the rhythm, the rises and falls, where a voice speeds up or leans on a word. It outputs not audio but a kind of visual blueprint of sound called a spectrogram, essentially a heat-map of which pitches are loud at each instant. The final station, the neural vocoder, is the artist that turns that blueprint into an actual waveform you can hear. The pioneering vocoder, DeepMind's WaveNet, builds audio one tiny slice at a time, 16,000 samples for every single second of speech — which is why early versions were slow, and why newer designs like HiFi-GAN can paint the same second of audio far faster than real time.

So why, with all that machinery, does the voice still trip over a name or slide into the wrong accent? The honest answer is that every one of these models learned by imitation, not by understanding, and it can only confidently reproduce what it saw a lot of. One landmark system, VALL-E, was trained on 60,000 hours of recorded speech — hundreds of times more than earlier systems — and yet scale is exactly the trap. A model trained mostly on standard American or British English will faithfully, and incorrectly, stamp that accent onto everything, because those are the accents most common in the data it ate. A surname it has never encountered is a coin flip. The model is not malfunctioning; it is doing precisely what its training distribution taught it to do.

The polyphone problem is the sharpest version of this for Chinese, and it is measurable. Researchers built classifiers specifically to guess the right reading of a multi-pronunciation character from context, and a well-known study lifted polyphone accuracy to 96.35%, up from 81.22% for a naive frequency-based guess — a real leap, but also a quiet admission. Even a strong system is wrong on a meaningful slice of characters, and in Taiwan Mandarin or Cantonese, where readings and tones carry the meaning, a single wrong syllable can turn a clear sentence into a puzzle. A listener who knows the language hears it instantly. The software, by design, does not.

These are not vague worries; they are four specific, locatable failure points: a normalization stage that guesses at intent, a G2P stage that can pick the wrong reading, a name it never learned, and an accent the data tilted toward. This is exactly the map of where a native speaker ear earns its keep. At Onyx, every AI-generated delivery is checked by a native speaker of the target language before it ships — someone who catches the polyphone that flipped, the surname that came out garbled, the Taiwan or Hong Kong line that drifted into a mainland cadence. The machine produces the draft in seconds; the human guarantees it is the version you would actually say out loud.

That is the whole reason behind our line, AI-Generated. Human-Perfected. We are not nostalgic about keeping a person in the loop, and we are not skeptical of the technology — we use it every day, and we love what it can do at speed and scale. We simply know the pipeline well enough to know its four blind spots, and we know that in Taiwan Mandarin, Cantonese, and the 40-plus languages we work in, the difference between impressive and correct is a native ear. If you want voice that is fast and right, that is the promise: the machine gives you the speed, and a human who actually speaks the language gives you the trust.

Text-to-SpeechAI ExplainedTechnologyEducation

How Generative Voice AI Actually Works (in Plain English)

Hear our AI voices

How Generative Voice AI Actually Works (in Plain English)

Hear our AI voices