Publication
ICSLP 2000
Conference paper

Dynamic Selection of feature spaces for robust speech recognition

Abstract

Selection of acoustic features for robust speech recognition has been the subject of research for several years. In the past, algorithms that use feature vectors from multiple frequency bands [9], or employ techniques to switch between multiple feature streams [10] have been reported in the literature to handle robustness under different acoustic conditions. Acoustic models built out of different feature sets produce different kinds of recognition errors. In this paper, we propose a likelihood-based scheme to combine the acoustic feature vectors from multiple signal processing schemes within the decoding framework, in order to extract maximum benefit from these different acoustic feature vectors and models. The proposed technique is general enough to be applied to other pattern recognition fields, such as, OCR, handwriting recognition, etc. The fundamental idea behind this approach is to pick the set of features that classifies a frame of speech accurately with no apriori information about the phonetic class or acoustic channel that this speech comes from. Two methods of merging any set of acoustic features, such as, formant-based features, cepstral feature vectors, PLP features, LDA features etc., are presented here: · Use of a weighted set of likelihoods obtained from these several alternative feature sets and · Selection of the feature space that ranks the best when used in a rank-based recognizer These merging algorithms provide an impressive reduction in error rate between 8% to 15% relative across a wide variety of wide-band, clean and noisy large vocabulary continuous speech recognition tasks. Much of this gain is from the reduced insertion and substitution errors. Using the approach presented in this paper, we have achieved better improved acoustic modeling without increasing the number of parameters, i.e. two 40K Gaussian systems, when merged perform better than a 180K Gaussian system trained on the better of the two feature spaces.

Date

Publication

ICSLP 2000

Authors

Share