Assessing face and speech consistency for monologue detection in video
Abstract
This paper considers schemes for determining which of a set of faces on screen, if any, is producing speech in a video soundtrack. Whilst motivated by the TREC 2002 (Video Retrieval Track) monologue detection task, the schemes are also applicable to voice and face-based biometrics systems, for assessing lip synchronization quality in movie editing and computer animation, and for speaker localization in video. Several approaches are discussed: two implementations of a generic mutual-information-based measure of the degree of synchrony between signals, which can be used with or without prior speech and face detection, and a stronger model-based scheme which follows speech and face detection with an assessment of face and lip movement plausibility. Schemes are compared on a corpus of 1016 test cases containing multiple faces and multiple speakers, a test set 200 times larger than the nearest comparable test set of which we are aware. The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent"speaker from a set including three confusers. A final experiment demonstrates the potential utility of the scheme for speaker localization in video.