Enhancing exemplar-based posteriors for speech recognition tasks
Abstract
Posteriors generated from exemplar-based sparse representation methods are often learned to minimize reconstruction error of the feature vectors. These posteriors are not learned through a discriminative process linked to the word error rate (WER) objective of a speech recognition task. In this paper, we explore modeling exemplar-based posteriors to address this issue. We first explore posterior modeling by training a Neural Network using exemplar-based posteriors as inputs. This produces a new set of posteriors which have been learned to minimize a cross-entropy measure, and indirectly frame error rate. Second, we take the new NN posteriors and apply a tied mixture smoothing technique to these posteriors, making them more suited for a speech recognition task. On the TIMIT task, we show that using a NN model, we can improve the performance of our sparse representations by 1.3% absolute, achieving a PER of 19.0% by modeling SR posteriors with a NN. Furthermore, taking these NN posteriors and applying further smoothing techniques, we improve the PER to 18.7%, one of the best results reported in the literature on TIMIT.