Speaking rate adaptation using continuous frame rate normalization
Abstract
This paper describes a speaking rate adaptation technique for automatic speech recognition. The technique aims to reduce speaking rate variations by applying temporal warping in front-end processing so that the average phone duration in terms of feature frames remains constant. Speaking rate estimates are given by timing information from unadapted decoding outputs. We implement the proposed continuous frame rate normalization (CFRN) technique on a state-of-the-art speech recognition architecture, and evaluate it on the most recent GALE broadcast transcription tasks. Results show that CFRN gives consistent improvement on all four separate systems and two different languages. In fact, the reported numbers represent the best decoding error rates of the corresponding test sets. It is further shown that the technique is effective without retraining, and adds little overhead to the multi-pass recognition pipeline found in state-of-the-art transcription systems. ©2010 IEEE.