A factor graph framework for semantic video indexing
Abstract
Video query by semantic keywords is one of the most challenging research issues in video data management. To go beyond low-level similarity and access video data content by semantics, we need to bridge the gap between the low-level representation and high-level semantics. This is a difficult multimedia understanding problem. We formulate this problem as a probabilistic pattern-recognition problem for modeling semantics in terms of concepts and context. To map low-level features to high-level semantics, we propose probabilistic multimedia objects (multijects). Examples of multijects in movies include explosion, mountain, beach, outdoor, music, etc. Semantic concepts in videos interact and appear in context. To model this interaction explicitly, we propose a network of multijects (multinet). To model the multinet computationally, we propose a factor graph framework which can enforce spatio-temporal constraints. Using probabilistic models for multijects, rocks, sky, snow, water-body, and forestry/greenery, and using a factor graph as the multinet, we demonstrate the application of this framework to semantic video indexing. We demonstrate how detection performance can be significantly improved using the multinet to take inter-conceptual relationships into account. Our experiments using a large video database consisting of clips from several movies and based on a set of five semantic concepts reveal a significant improvement in detection performance by over 22%. We also show how the multinet is extended to take temporal correlation into account. By constructing a dynamic multinet, we show that the detection performance is further enhanced by as much as 12%. With this framework, we show how keyword-based query and semantic filtering is possible for a predetermined set of concepts.