Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition
Abstract
In this paper we investigate feature space transforms to improve lip-reading performance for multi-stream HMM based audio-visual speech recognition (AVSR). The feature space transforms include non-linear Gaussianization transform and feature space maximum likelihood linear regression (fMLLR). We apply Gaussianization at the various stages of visual front-end. The results show that Gaussianizing the final visual features achieves the best performance: 8% gain on lip-reading and 14% gain on AVSR. We also compare performance of speaker-based Gaussianization and global Gaussianization. Without fMLLR adaptation, speaker-based Gaussianization improves more on lip-reading and multi-stream AVSR performance. However, with fMLLR adaptation, global Gaussianization shows better results, and achieves 18% over baseline fMLLR adaptation for AVSR.