Home

Company

Emotion Analysis in DAISYS Speech Generation

Author

Team DAISYS

In this technical post, we explore our approach to emotion analysis in speech synthesis, and how we complement our in-house methods with external tools to enhance expressiveness and control. We'll take a closer look at one critical element of our pipeline: emotion analysis, and how it helps us to generate speech that sounds more expressive, controllable, and human-like.

Why Emotion Matters in Voice Synthesis

One of our primary goals at DAISYS is to create high-quality synthetic voices that go beyond the usual TTS; we want them to be affective and controllable, capable of delivering speech in a wide range of emotional tones in a reliable and interpretable way.

To do this, we rely on curated datasets that emphasize natural expressiveness. These recordings go beyond a neutral tone of voice: they include variations in mood, and delivery that reflect different attitudes, emotional states, and speaking contexts.

But having rich data is only part of the story. Extracting meaningful emotion-related information from voice signals is well known to be a non-trivial challenge.

🔌 From Signal Features to Emotional Dimensions

Initially, our approach to emotion estimation focused on more traditional acoustic features like pitch, energy, and speaking rate, known correlates of emotional speech. A rise in pitch and an increase in speaking energy might, for example, indicate excitement or urgency.

Yet, sometimes slowing down for instance also provides a level of emphasis in a different way, and these different "ways" of expression are not easily captured by simple signal analysis. For instance, anger and excitement can both involve raised pitch and energy, but they differ significantly in valence and intent. Similarly, sarcasm may involve low arousal but high dominance, a nuance that simple acoustic metrics don't capture well.

To move beyond this, we integrated more specialized tools into our pipeline.

🛠️ Leveraging the Audeering devAIce Toolkit

One of the tools we now use is the Audeering devAIce SDK, which provides advanced emotion recognition capabilities based on continuous affective dimensions. Not limited to categorizing emotions into discrete buckets (e.g., happy, sad, angry), devAIce models vocal expression in terms of:

Arousal: how active or passive a voice sounds
Dominance: how confident or authoritative the speech appears
Valence: the positivity or negativity of the affect (optional dimension depending on context)

These continuous metrics give us a finer-grained view of emotional content in speech and allow for far more nuanced control in our synthesis process.

For example, when designing speech styles for a virtual assistant or narrative character, we can control not just whether the voice "sounds happy", but also how intense that happiness is, or whether it comes with a more submissive or assertive tone.

🔬 From Analysis to Control

Integrating affective analysis into our system not only improves voice modeling, but also unlocks new capabilities on the generation of speech utterances. By embedding these affective dimensions into our control interface, we enable developers and end-users to directly specify emotional states in a structured way.

This will play an important role in our upcoming text-to-voice system. With it, users will be able to generate expressive speech with emotion levels that are both interpretable and reliably reproduced, yet can be tweaked in detail.