2 min read

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Large language models (LLMs) have revolutionized text processing, but adapting them to speech presents unique challenges due to speech's continuous and complex nature. A new paper titled A Variational Framework for Improving Naturalness in Generative Spoken Language Models introduces an innovative approach to enhance speech generation by addressing the limitations of current token-based methods.

The Problem with Token-Based Speech Models

Speech is often discretized into tokens for autoregressive modeling, similar to text. These semantic tokens, derived from self-supervised models like HuBERT, primarily capture linguistic information (e.g., phonetics) but often neglect prosodic features such as pitch, energy, and spectral attributes. As a result, models trained on these tokens can generate speech that sounds robotic or unnatural.

Existing solutions, like augmenting tokens with pitch features, are suboptimal because pitch alone doesn’t fully represent paralinguistic attributes. Moreover, manually engineering these features adds complexity and may not capture the full range of speech nuances.

The Proposed Solution: A Variational Approach

The paper proposes an end-to-end variational framework that automatically learns to encode continuous speech attributes, complementing semantic tokens without requiring hand-engineered features. Here’s how it works:

  1. Variational Autoencoder (VAE) Integration: The model learns continuous latent features (variational features) alongside discrete semantic tokens. These features are optimized to:
  • Reconstruct the input speech accurately.
  • Enhance the autoregressive modeling process by capturing prosodic and other paralinguistic cues.
  1. Balancing Reconstruction and Prediction: The framework introduces two hyperparameters, β and γ, to balance:
  • Reconstruction loss (O_rec): Ensures the variational features retain useful information for speech synthesis.
  • Prediction losses (Lkl^c and Lkl^d): Train the autoregressive model to predict both variational features and semantic tokens effectively.
  1. Normalizing Flows for Expressive Priors: A lightweight normalizing flow improves the autoregressive prior’s ability to model complex distributions of variational features.

Key Results

The proposed method was evaluated on LibriSpeech and Libri-light datasets, comparing it against baseline approaches like Token-LM (semantic tokens only) and Token-LM + Pitch. Key findings include:

  • Improved Naturalness: Human raters preferred speech generated by the variational framework, giving it higher naturalness scores (N-MOS) than baselines.
  • Comparable Meaningfulness: The model maintained linguistic quality (M-MOS) while significantly improving prosody and expressiveness.
  • Better Reconstruction: The variational features helped reconstruct speech more faithfully, as measured by F0-RMSE (pitch accuracy) and MCD (spectral distortion).

Why This Matters

This work bridges the gap between text-like discrete token modeling and the rich, continuous nature of speech. By learning variational features end-to-end, the model avoids the pitfalls of manual feature engineering and generates more natural-sounding speech. This has implications for:

  • Conversational AI: More expressive and human-like voice assistants.
  • Text-to-Speech (TTS): Higher-quality synthesis with better prosody.
  • Speech Compression: Efficient encoding of both linguistic and paralinguistic information.

Limitations and Future Work

The method’s performance is sensitive to hyperparameters (β and γ), and future work could explore automated tuning. Additionally, the framework hasn’t been tested on non-English languages, which may have different prosodic patterns.

Final Thoughts

This paper presents a significant step toward more natural generative speech models. By combining the strengths of variational autoencoders and token-based language modeling, it offers a flexible and scalable solution for improving speech synthesis. The code and models are available on GitHub, inviting further exploration and application in the field.

For more details, check out the full paper on arXiv.