Large-Scale Video Event Detection

Abstract

Because of the rapid growth of large scale video recording and sharing, there is a growing need for robust and scalable solutions for analyzing video content. The ability to detect and recognize video events that capture real-world activities is one of the key and complex problems. This thesis aims at the development of robust and efficient solutions for large scale video event detection systems. In particular, we investigate the problem in two areas: first, event detection with automatically discovered event specific concepts with organized ontology, and second, event detection with multi-modality representations and multi-source fusion. Existing event detection works use various low-level features with statistical learning models, and achieve promising performance. However, such approaches lack the capability of interpreting the abundant semantic content associated with complex video events. Therefore, mid-level semantic concept representation of complex events has emerged as a promising method for understanding video events. In this area, existing works can be categorized into two groups: those that manually define a specialized concept set for a specific event, and those that apply a general concept lexicon directly borrowed from existing object, scene and action concept libraries. The first approach seems to require tremendous manual efforts, whereas the second approach is often insufficient in capturing the rich semantics contained in video events. In this work, we propose an automatic event-driven concept discovery method, and build a large-scale event and concept library with well-organized ontology, called EventNet. This method is different from past work that applies a generic concept library independent of the target while not requiring tedious manual annotations. Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet library consistently and significantly outperforms the state-of-the-art methods. Although concept-based event representation can interpret the semantic content of video events, in order to achieve high accuracy in event detection, we also need to consider and combine various features of different modalities and/or across different levels. One one hand, we observe that joint cross-modality patterns (e.g., audio-visual pattern) often exist in videos and provide strong multi-modal cues for detecting video events. We propose a joint audio-visual bi-modal codeword representation, called bi-modal words, to discover cross-modality correlations. On the other hand, combining features from multiple sources often produces performance gains, especially when the features complement with each other. Existing multi-source late fusion methods usually apply direct combination of confidence scores from different sources. This becomes limiting because heterogeneous results from various sources often produce incomparable confidence scores at different scales. This makes direct late fusion inappropriate, thus posing a great challenge. Based upon the above considerations, we propose a robust late fusion method with rank minimization, that not only achieves isotonicity among various scores from different sources, but also recovers a robust prediction score for individual test samples. We experimentally show that the proposed multi-modality representation and multi-source fusion methods achieve promising results compared with other benchmark baselines. The main contributions of the thesis include the following. 1. Large scale event and concept ontology: a) propose an automatic framework for discovering event-driven concepts; b) build the largest video event ontology, EventNet, which includes 500 complex events and 4,490 event-specific concepts; c) build the first interactive system that allows users to explore high-level events and associated concepts in videos with event browsing, search, and tagging functions. 2. Event detection with multi-modality representations and multi-source fusion: a) propose novel bi-modal codeword construction for discovering multi-modality correlations; b) propose novel robust late fusion with rank minimization method for combining information from multiple sources. The two parts of the thesis are complimentary. Concept-based event representation provides rich semantic information for video events. Cross-modality features also provide complementary information from multiple sources. The combination of those two parts in a unified framework can offer great potential for advancing state-of-the-art in large-scale event detection

    Similar works