Publication
MM 2004
Conference paper

Optimal multimodal fusion for multimedia data analysis

View publication

Abstract

Considerable research has been devoted to utilizing multimodal features for better understanding multimedia data. However, two core research issues have not yet been adequately addressed. First, given a set of features extracted from multiple media sources (e.g., extracted from the visual, audio, and caption track of videos), how do we determine the best modalities? Second, once a set of modalities has been identified, how do we best fuse them to map to semantics? In this paper, we propose a two-step approach. The first step finds statistically independent modalities from raw features. In the second step, we use super-kernel fusion to determine the optimal combination of individual modalities. We carefully analyze the tradeoffs between three design factors that affect fusion performance: modality independence, curse of dimensionality, and fusion-model complexity. Through analytical and empirical studies, we demonstrate that our two-step approach, which achieves a careful balance of the three design factors, can improve class-prediction accuracy over traditional techniques.

Date

Publication

MM 2004

Authors

Share