Robust video scene detection using multimodal fusion of optimally grouped features
Abstract
Video scene detection, the task of temporally dividing a video into its semantic sections, is an important process for effective analysis of heterogeneous video content. With the increased amount of video available for consumption, video scene detection becomes more and more important by providing means for effective video summarization, search and retrieval, browsing, and video understanding. We formulate the problem of video scene detection as a generic optimization problem aimed at partitioning a video given a set of features derived from multiple modalities. By optimally grouping consecutive shots into scenes, our method presents an effective and efficient solution for dividing a video into sections using a unique dynamic programming scheme. Unlike existing methods, it allows us to directly obtain temporally consistent video scene detection and has the advantage of being parameter-free, making it robust and applicable to various types of video content. Our experimental results show that our proposed multimodal approach can provide a significant gain compared to using only a single modality (e.g., either video or audio alone). Additionally, our method outperforms the state of the art in video scene detection, clearly demonstrating the effectiveness of the proposed method. As part of this work, we also provide a significant extension to our Open Video Scene Detection dataset (OVSD), which comprises open licensed videos freely available for academic and industrial use. This extension, which increases the OVSD's cumulative duration from the original 2.5 hours to over 17 hours, makes this dataset the most extensive evaluation tool for the problem of video scene detection.