Publication
TRECVID 2012
Conference paper

IBM research and columbia university TRECVID-2012 multimedia event detection (MED), multimedia event recounting (MER), and semantic indexing (SIN) systems

Abstract

For this year's TRECVID Multimedia Event Detection task, our team studied high-level visual and audio semantic features, midlevel visual attributes, and sophisticated low-level features. In addition, a range of new modeling strategies were studied, including those that take into account temporal dynamics of event semantics, optimize fusion of system components, provide linear approximations of non-linear kernels, and generate synthetic data for the limited exemplar condition. For the Pre-Specified task, we submitted 4 runs: Run 1 involved the fusion of a broad array of sophisticated low-level features. Run 2 involved the same set of low-level features to model the events under the limited exemplar condition. Run 3 involved the fusion of all our semantic system components. Run 4 was composed of the fusion of all low-level and semantic features used in Runs 1-3, in addition to event models built from techniques for linear approximation of non-linear kernels. For Ad Hoc, we submitted 2 runs: Run 5, which was the fusion of Linear Temporal Pyramids of visual semantics, fused with event models built directly on low-level features. Run 6 was our limited exemplar run, which used both Linear Temporal Pyramids of visual semantics, as well as a method for generating synthetic training data. Our experiments suggest the following: 1) Semantic modeling improves the event modeling performance of the low-level features they are based on. 2) Mid-level visual attributes contribute complimentary information. 3) Event videos demonstate temporal patterns. 4) Linear approximation methods to nonlinear kernels perform similarly to the original non-linear ker-nels, and hold promise to improve event modeling performance by allowing a scaling up to a broader array of models.