Database mining for flexible concatenative text-to-speech

Ellen M. Eide; Raul Fernandez

doi:10.1109/ICASSP.2007.367008

ICASSP 2007

Conference paper

06 Aug 2007

Database mining for flexible concatenative text-to-speech

View publication

Abstract

In this paper we explore mining a concatenative text-to-speech database to exploit subtle, naturally-occurring stylistic and contextual variability for runtime synthesis. By making a desired style or context known to the search during synthesis, the cost function can be biased toward finding units which satisfy these additional criteria. Having the ability to bias the output of the synthesizer towards a particular voice quality, or other characteristic such as speaking rate, increases its flexibility and potential value. In this paper we illustrate the approach to synthesizing subtle speech variation by focusing on three aspects: prosodic structure (phrase-finalness), prosodic prominence (prosodic accent), and voice quality (breathiness). Target values for the first two of these are automatically generated, while the target value for breathiness is specified by the user. We present results which indicate the value of distinguishing our data along these dimensions, and discuss possible improvements and new uses in the future. © 2007 IEEE.

Conference paper

Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis

Conference paper

Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP

Conference paper

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture

Conference paper

Stable checkpoint selection and evaluation in sequence to sequence speech synthesis

View all publications

Abstract

Related

Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis

Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture

Stable checkpoint selection and evaluation in sequence to sequence speech synthesis