Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data
Abstract
Many applications of spoken-language systems can benefit from having access to annotations of prosodic events. Unfortunately, obtaining human annotations of these events, even sensible amounts to train a supervised system, can become a laborious and costly effort. In this paper we explore applying conditional random fields to automatically label major and minor break indices and pitch accents from a corpus of recorded and transcribed speech using a large set of fully automatically-extracted acoustic and linguistic features. We demonstrate the robustness of these features when used in a discriminative training framework as a function of reducing the amount of training data. We also explore adapting the baseline system in an un-supervised fashion to a target dataset for which no prosodic labels are available, and show how, when operating at point where only limited amounts of data are available, an unsupervised approach can offer up to an additional 3% improvement. © 2010 ISCA.