Tokens contain content
A direct held-out token-content probe found the semantic stream is far above a majority baseline.
65.2% phoneme frame accuracyTimpani Phase 5 is testing a compact speech representation: tokenize WavLM features around 9.375 Hz, add a small residual stream, train a language model over the pair, then render back to mel and audio.
Instead of asking a reconstruction codec to discover semantic units as a side effect, Phase 5 starts from a strong speech teacher. WavLM features already carry phonetic and speaker/prosody information, so the experiment compresses them directly into a low-frequency token stream.
The semantic stream carries content. The residual stream is intentionally small and captures local acoustic detail that the semantic token alone misses. The LM predicts the next semantic token first, then the residual token conditioned on that semantic choice.
A direct held-out token-content probe found the semantic stream is far above a majority baseline.
65.2% phoneme frame accuracyA plain factorized cross-entropy LM strongly improves over unigram prediction.
9.206 bits/frame best CETeacher-token reconstructions and short rollouts often produce intelligible, content-related speech.
0.6185 held-out mel L1These are short held-out render samples. The reference clips are target mels through BigVGAN; the generated clips use the token or continuous renderer.
Clean reference vocoded from target mel.
Semantic k1024 + residual k64 tokens through the mel decoder.
Another held-out reference clip.
A stronger content example from the same renderer.
Generated audio can have a persistent high-pitched buzz or crackle. It appears in both token-based and continuous-WavLM renderers, while target mels through BigVGAN sound clean.
Reference mels vocoded through BigVGAN are clean, so the base vocoder path is probably not the primary failure.
Band-swap diagnostics point mostly at low/mid predicted mel structure, with high bands only secondary.
Delta losses, low-motion losses, calibration, vocoder adaptation, and a first flow decoder probe did not remove it under human listening.
The artifact is easiest to hear by comparing the reference with generated mels on the same sample, then comparing token and continuous WavLM renderers.
Target mel through BigVGAN; this is the clean control.
Generated mel through BigVGAN; this exposes the buzz/crackle.
Quantized token renderer from the vocoder interface audit.
Continuous WavLM/PCA renderer; useful for separating tokenization from rendering.
The representation is promising: low-rate WavLM-derived tokens are content-rich, compact, and learnable by an LM. The remaining bottleneck is the audio renderer: predicted mel trajectories are close enough to carry words, but not yet clean enough to vocode without artifacts. The next useful diagnostics are alternate vocoders on the same predicted mels, residual/refiner-style mel detail models, and stronger vocoder-sensitive trajectory objectives.