Improving viseme recognition with GAN-based muti-view mapping
Abstract
Speech recognition technologies in the visual domain currently can only identify words and sentences in still images. Identifying visemes (i.e., the smallest visual units of spoken text) is useful when there are no language models or dictionaries available, which is often the case for languages besides English; however, it is a challenge, as temporal information cannot be extracted. In parallel, previous works demonstrated that exploring data acquired simultaneously under multiple views can improve the recognition accuracy in comparison to single-view data. For many different applications, however, most of the available audio-visual datasets are obtained from a single view, essentially due to acquisition limitations. In this work, we address viseme recognition in still images and explore the synthetic generation of additional views to improve overall accuracy. For that, we use Generative Adversarial Networks (GANs) trained with synthetic data and map from mouth images acquired in a single arbitrary view to frontal and side views - in which the face is rotated vertically at approximately 30°, 45°, and 60°. Then, we use a state-of-art Convolutional Neural Network for classifying the visemes and compare its performance when training only with the original single-view images versus training with the additional views artificially generated by the GANs. We run experiments using three audiovisual corpora acquired under different conditions (GRID, AVICAR, and OuluVS2 datasets) and our results indicate that the additional views synthesized by the GANs are able to improve the viseme recognition accuracy on all tested scenarios.