Corpus building for data-driven TTS systems
Abstract
To generate a data-driven TTS system of Mandarin, we built a large and balanced Mandarin text-and-speech corpus, named IBM Mandarin TTS Corpus. The corpus is designed for both statistical prosody modeling, and context dependence of phonemic features. In the script-design stage, we investigated the problem of a proper synthetic unit. Based on the appropriate choice of synthetic unit, we developed a numerical criterion for the coverage and balance of variants of the synthetic units. In the speech-recording stage, we paid attention to speaking style, which is essential to generate an effective concatenative speech synthesis system. We formulated a specification of speaking style, and guided the speaker to strictly follow the guidelines. Corpus processing is another important step. In that step, we carefully executed pronunciation marking, segment aligning, and the prosodic events labeling, etc. We defined a set of prosodic hierarchical layers, to describe various prosodic events. Because those issues often involve manual effort, the quality of the processed corpus depends on both proper specifications for each step, and the training of the operating team.