Stream confidence estimation for audio-visual speech recognition
Abstract
We investigate the use of single modality confidence measures as a means of estimating adaptive,local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audio- or visual-only observation probability, raised to an appropriate exponent. We consider such stream exponents as two-dimensional piecewise constant functions of the audio and visual stream local confidences, and we estimate them by minimizing the misclassification error on a held-out data set. Three stream confidence measures are investigated, namely the stream entropy, the n-best likelihood ratio average, and an n-best stream likelihood dispersion measure. The later results in superior audio-visual phonetic classification, as indicated by our experiments on a 260-subject, 40-hour long, large vocabulary, continuous speech audio-visual dataset. By using local, dispersion-based stream exponents, we achieve an additional 20% phone classification accuracy improvement over the improvement that global stream exponents add to clean audio-only phonetic classification. The performance of the algorithm however still falls significantly short of an "oracle" (cheating) confidence estimation scheme.