Mutual information based visual feature selection for lipreading
Abstract
Image transforms, such as the discrete cosine, are widely used to extract visual features from the speaker's mouth region to be used in automatic speechreading and audio-visual speech recognition. Typically, the spatial frequency components with the highest energy in the transform space are retained for recognition. This paper proposes an alternative technique for selecting such features, by utilizing the mutual information criterion instead. Mutual information between each individual spatial frequency component and the speech classes of interest is employed as a measure of its appropriateness for speech classification. The highest mutual information components are then selected as visual speech features. Extensions to this scheme by using joint mutual information between candidate feature pairs and classes are also considered. The algorithm is tested on visual-only speech recognition of connected-digit strings, using an appropriate audio-visual database. For low-dimensional visual feature vectors, the proposed method significantly outperforms features selected by means of energy, reducing word error rate by as much as 20% relative. These gains however diminish as higher feature dimensionalities are allowed.