Low-rate speech tokens are starting to look modelable.

Timpani Phase 5 is testing a compact speech representation: tokenize WavLM features around 9.375 Hz, add a small residual stream, train a language model over the pair, then render back to mel and audio.

WavLM/PCA semantic tokens 9.375 Hz token rate k1024 semantic + k64 residual LM baseline: 9.206 bits/frame
Speech audio10 second LibriSpeech clips
24 kHz
WavLM featuresPCA192, pooled to a slow grid
9.375 Hz
Semantic tokensk-means codebook over WavLM/PCA
k1024
+
Residual streamsmall acoustic correction token
k64
Factorized LM → mel decoder → BigVGANpredict, render, listen
audio

The core idea

Instead of asking a reconstruction codec to discover semantic units as a side effect, Phase 5 starts from a strong speech teacher. WavLM features already carry phonetic and speaker/prosody information, so the experiment compresses them directly into a low-frequency token stream.

The semantic stream carries content. The residual stream is intentionally small and captures local acoustic detail that the semantic token alone misses. The LM predicts the next semantic token first, then the residual token conditioned on that semantic choice.

Experiment setup

The current checkpoint is a compact 10-second speech-token experiment, trained on LibriSpeech clips and evaluated on a separate 2k held-out cache.

Training data 128,006 ten-second crops from the codebook-train split; about 355.6 hours represented.
Token stream 94 low-rate frames per clip at 9.375 Hz: k1024 WavLM/PCA semantic token plus k64 residual token.
Language model Factorized transformer, d512/l8/h8, 26.6M params, 8k steps; predicts semantic then residual.
Renderer Token-to-mel decoder, d512/l8/h8 plus 4 conv layers, 31.3M params, then BigVGAN vocoding.

What seems to be working

Tokens contain content

A direct held-out token-content probe found the semantic stream is far above a majority baseline.

65.2% phoneme frame accuracy

The stream is trainable

A plain factorized cross-entropy LM strongly improves over unigram prediction.

9.206 bits/frame best CE

Rendered content is plausible

Teacher-token reconstructions and short rollouts often produce intelligible, content-related speech.

0.6185 held-out mel L1

LM rollout listening checks

These held-out examples show the actual conditional test. The model sees the prefix, then we compare the true future suffix, the decoder's teacher-token rendering of that suffix, and the language model's autoregressive suffix. Transcripts are automated Gemini /listen outputs, so they are approximate.

LM rollout, sample 001

active suffix, content drift in AR

Prefix

/listen
But the feelings which made such a composure a disgrace, left her in no danger of incurring it. She was awake

True suffix

/listen
the whole night, and she wept the greatest part of it.

Generated suffix (teacher tokens)

/listen
the whole night, and she wept the greatest cry.

Generated suffix (AR rollout)

/listen
looking up and said, but

LM rollout, sample 017

teacher tokens preserve named content

Prefix

/listen
considering her sister's youth, and urged the matter farther, but in vain. Common sense, common care, common

True suffix

/listen
prudence were all sunk in Mrs. Dashwood's romantic

Generated suffix (teacher tokens)

/listen
brilliance, we all sunk in Mrs. Dashwood's good.

Generated suffix (AR rollout)

/listen
and then he had

LM rollout, sample 027

plausible speech, weak continuation

Prefix

/listen
Amongst the objects in the scene, they soon discovered an animated one. It was a man on horseback riding towards

True suffix

/listen
In a few minutes, they could distinguish him to be a gentleman.

Generated suffix (teacher tokens)

/listen
in a tournament they could distinguish him.

Generated suffix (AR rollout)

/listen
day and the man of the third of the man

The current problem

There is still a buzz

Generated audio can have a persistent high-pitched buzz or crackle. It appears in both token-based and continuous-WavLM renderers, while target mels through BigVGAN sound clean.

Not just BigVGAN

Reference mels vocoded through BigVGAN are clean, so the base vocoder path is probably not the primary failure.

Not just high bands

Band-swap diagnostics point mostly at low/mid predicted mel structure, with high bands only secondary.

Not solved by loss tweaks

Delta losses, low-motion losses, calibration, vocoder adaptation, and a first flow decoder probe did not remove it under human listening.

Buzz diagnostics

The artifact is easiest to hear by comparing the reference with generated mels on the same sample, then comparing token and continuous WavLM renderers.

Artifact sample 005 - reference

Target mel through BigVGAN; this is the clean control.

Artifact sample 005 - prediction

Generated mel through BigVGAN; this exposes the buzz/crackle.

Token renderer

Quantized token renderer from the vocoder interface audit.

Continuous WavLM renderer

Continuous WavLM/PCA renderer; useful for separating tokenization from rendering.

Where this leaves the project

The representation is promising: low-rate WavLM-derived tokens are content-rich, compact, and learnable by an LM. The remaining bottleneck is the audio renderer: predicted mel trajectories are close enough to carry words, but not yet clean enough to vocode without artifacts. The next useful diagnostics are alternate vocoders on the same predicted mels, residual/refiner-style mel detail models, and stronger vocoder-sensitive trajectory objectives.