Tokens contain content
A direct held-out token-content probe found the semantic stream is far above a majority baseline.
65.2% phoneme frame accuracyTimpani Phase 5 is testing a compact speech representation: tokenize WavLM features around 9.375 Hz, add a small residual stream, train a language model over the pair, then render back to mel and audio.
Instead of asking a reconstruction codec to discover semantic units as a side effect, Phase 5 starts from a strong speech teacher. WavLM features already carry phonetic and speaker/prosody information, so the experiment compresses them directly into a low-frequency token stream.
The semantic stream carries content. The residual stream is intentionally small and captures local acoustic detail that the semantic token alone misses. The LM predicts the next semantic token first, then the residual token conditioned on that semantic choice.
The current checkpoint is a compact 10-second speech-token experiment, trained on LibriSpeech clips and evaluated on a separate 2k held-out cache.
A direct held-out token-content probe found the semantic stream is far above a majority baseline.
65.2% phoneme frame accuracyA plain factorized cross-entropy LM strongly improves over unigram prediction.
9.206 bits/frame best CETeacher-token reconstructions and short rollouts often produce intelligible, content-related speech.
0.6185 held-out mel L1These held-out examples show the actual conditional test. The model sees the prefix, then we compare the true future suffix, the decoder's teacher-token rendering of that suffix, and the language model's autoregressive suffix. Transcripts are automated Gemini /listen outputs, so they are approximate.
/listen
But the feelings which made such a composure a disgrace, left her in no danger of incurring it. She was awake
/listen
the whole night, and she wept the greatest part of it.
/listen
the whole night, and she wept the greatest cry.
/listen
looking up and said, but
/listen
considering her sister's youth, and urged the matter farther, but in vain. Common sense, common care, common
/listen
prudence were all sunk in Mrs. Dashwood's romantic
/listen
brilliance, we all sunk in Mrs. Dashwood's good.
/listen
and then he had
/listen
Amongst the objects in the scene, they soon discovered an animated one. It was a man on horseback riding towards
/listen
In a few minutes, they could distinguish him to be a gentleman.
/listen
in a tournament they could distinguish him.
/listen
day and the man of the third of the man
Generated audio can have a persistent high-pitched buzz or crackle. It appears in both token-based and continuous-WavLM renderers, while target mels through BigVGAN sound clean.
Reference mels vocoded through BigVGAN are clean, so the base vocoder path is probably not the primary failure.
Band-swap diagnostics point mostly at low/mid predicted mel structure, with high bands only secondary.
Delta losses, low-motion losses, calibration, vocoder adaptation, and a first flow decoder probe did not remove it under human listening.
The artifact is easiest to hear by comparing the reference with generated mels on the same sample, then comparing token and continuous WavLM renderers.
Target mel through BigVGAN; this is the clean control.
Generated mel through BigVGAN; this exposes the buzz/crackle.
Quantized token renderer from the vocoder interface audit.
Continuous WavLM/PCA renderer; useful for separating tokenization from rendering.
The representation is promising: low-rate WavLM-derived tokens are content-rich, compact, and learnable by an LM. The remaining bottleneck is the audio renderer: predicted mel trajectories are close enough to carry words, but not yet clean enough to vocode without artifacts. The next useful diagnostics are alternate vocoders on the same predicted mels, residual/refiner-style mel detail models, and stronger vocoder-sensitive trajectory objectives.