Low-rate speech tokens are starting to look modelable.

Timpani Phase 5 is testing a compact speech representation: tokenize WavLM features around 9.375 Hz, add a small residual stream, train a language model over the pair, then render back to mel and audio.

WavLM/PCA semantic tokens 9.375 Hz token rate k1024 semantic + k64 residual LM baseline: 9.206 bits/frame
Speech audio10 second LibriSpeech clips
24 kHz
WavLM featuresPCA192, pooled to a slow grid
9.375 Hz
Semantic tokensk-means codebook over WavLM/PCA
k1024
+
Residual streamsmall acoustic correction token
k64
Factorized LM → mel decoder → BigVGANpredict, render, listen
audio

The core idea

Instead of asking a reconstruction codec to discover semantic units as a side effect, Phase 5 starts from a strong speech teacher. WavLM features already carry phonetic and speaker/prosody information, so the experiment compresses them directly into a low-frequency token stream.

The semantic stream carries content. The residual stream is intentionally small and captures local acoustic detail that the semantic token alone misses. The LM predicts the next semantic token first, then the residual token conditioned on that semantic choice.

What seems to be working

Tokens contain content

A direct held-out token-content probe found the semantic stream is far above a majority baseline.

65.2% phoneme frame accuracy

The stream is trainable

A plain factorized cross-entropy LM strongly improves over unigram prediction.

9.206 bits/frame best CE

Rendered content is plausible

Teacher-token reconstructions and short rollouts often produce intelligible, content-related speech.

0.6185 held-out mel L1

Listen

These are short held-out render samples. The reference clips are target mels through BigVGAN; the generated clips use the token or continuous renderer.

Sample 000 - reference

Clean reference vocoded from target mel.

Sample 000 - token reconstruction

Semantic k1024 + residual k64 tokens through the mel decoder.

Sample 015 - reference

Another held-out reference clip.

Sample 015 - token reconstruction

A stronger content example from the same renderer.

The current problem

There is still a buzz

Generated audio can have a persistent high-pitched buzz or crackle. It appears in both token-based and continuous-WavLM renderers, while target mels through BigVGAN sound clean.

Not just BigVGAN

Reference mels vocoded through BigVGAN are clean, so the base vocoder path is probably not the primary failure.

Not just high bands

Band-swap diagnostics point mostly at low/mid predicted mel structure, with high bands only secondary.

Not solved by loss tweaks

Delta losses, low-motion losses, calibration, vocoder adaptation, and a first flow decoder probe did not remove it under human listening.

Buzz diagnostics

The artifact is easiest to hear by comparing the reference with generated mels on the same sample, then comparing token and continuous WavLM renderers.

Artifact sample 005 - reference

Target mel through BigVGAN; this is the clean control.

Artifact sample 005 - prediction

Generated mel through BigVGAN; this exposes the buzz/crackle.

Token renderer

Quantized token renderer from the vocoder interface audit.

Continuous WavLM renderer

Continuous WavLM/PCA renderer; useful for separating tokenization from rendering.

Where this leaves the project

The representation is promising: low-rate WavLM-derived tokens are content-rich, compact, and learnable by an LM. The remaining bottleneck is the audio renderer: predicted mel trajectories are close enough to carry words, but not yet clean enough to vocode without artifacts. The next useful diagnostics are alternate vocoders on the same predicted mels, residual/refiner-style mel detail models, and stronger vocoder-sensitive trajectory objectives.