30 research outputs found
Recommended from our members
Large-Scale Video Event Detection
Because of the rapid growth of large scale video recording and sharing, there is a growing need for robust and scalable solutions for analyzing video content. The ability to detect and recognize video events that capture real-world activities is one of the key and complex problems. This thesis aims at the development of robust and efficient solutions for large scale video event detection systems. In particular, we investigate the problem in two areas: first, event detection with automatically discovered event specific concepts with organized ontology, and second, event detection with multi-modality representations and multi-source fusion.
Existing event detection works use various low-level features with statistical learning models, and achieve promising performance. However, such approaches lack the capability of interpreting the abundant semantic content associated with complex video events. Therefore, mid-level semantic concept representation of complex events has emerged as a promising method for understanding video events. In this area, existing works can be categorized into two groups: those that manually define a specialized concept set for a specific event, and those that apply a general concept lexicon directly borrowed from existing object, scene and action concept libraries. The first approach seems to require tremendous manual efforts, whereas the second approach is often insufficient in capturing the rich semantics contained in video events. In this work, we propose an automatic event-driven concept discovery method, and build a large-scale event and concept library with well-organized ontology, called EventNet. This method is different from past work that applies a generic concept library independent of the target while not requiring tedious manual annotations. Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet library consistently and significantly outperforms the state-of-the-art methods.
Although concept-based event representation can interpret the semantic content of video events, in order to achieve high accuracy in event detection, we also need to consider and combine various features of different modalities and/or across different levels. One one hand, we observe that joint cross-modality patterns (e.g., audio-visual pattern) often exist in videos and provide strong multi-modal cues for detecting video events. We propose a joint audio-visual bi-modal codeword representation, called bi-modal words, to discover cross-modality correlations. On the other hand, combining features from multiple sources often produces performance gains, especially when the features complement with each other. Existing multi-source late fusion methods usually apply direct combination of confidence scores from different sources. This becomes limiting because heterogeneous results from various sources often produce incomparable confidence scores at different scales. This makes direct late fusion inappropriate, thus posing a great challenge. Based upon the above considerations, we propose a robust late fusion method with rank minimization, that not only achieves isotonicity among various scores from different sources, but also recovers a robust prediction score for individual test samples. We experimentally show that the proposed multi-modality representation and multi-source fusion methods achieve promising results compared with other benchmark baselines.
The main contributions of the thesis include the following.
1. Large scale event and concept ontology: a) propose an automatic framework for discovering event-driven concepts; b) build the largest video event ontology, EventNet, which includes 500 complex events and 4,490 event-specific concepts; c) build the first interactive system that allows users to explore high-level events and associated concepts in videos with event browsing, search, and tagging functions.
2. Event detection with multi-modality representations and multi-source fusion: a) propose novel bi-modal codeword construction for discovering multi-modality correlations; b) propose novel robust late fusion with rank minimization method for combining information from multiple sources.
The two parts of the thesis are complimentary. Concept-based event representation provides rich semantic information for video events. Cross-modality features also provide complementary information from multiple sources. The combination of those two parts in a unified framework can offer great potential for advancing state-of-the-art in large-scale event detection
Recommended from our members
Experiments of Image Retrieval Using Weak Attributes
Searching images based on descriptions of image attributes is an intuitive process that can be easily understood by humans and recently made feasible by a few promising works in both the computer vision and multimedia communities. In this report, we describe some experiments of image retrieval methods that utilize weak attributes
Recommended from our members
Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance
Recognizing visual content in unconstrained videos has become a very important problem for many applications. Existingcorpora for video analysis lack scale and/or content diversity,and thus limited the needed progress in this critical area. In this paper, we describe and release a new database called CCV, containing 9,317 web videos over 20 semantic categories, including events like “baseball” and “parade”, scenes like “beach”, and objects like “cat”. The database was collected with extra care to ensure relevance to consumer interest and originality of video content without post-editing. Such videos typically have very little textual annotation and thus can benefit from the development of automatic content
analysis techniques. We used Amazon MTurk platform to perform manual annotation, and studied the behaviors and performance of human annotators on MTurk.We also compared the abilities in understanding consumer video content by humans and machines. For the latter, we implemented automatic classifiers using state-of-the-art multi-modal approach that achieved top performance in recent TRECVID multimedia event detection task. Results confirmed classifiers fusing audio and video features significantly outperform single-modality solutions. We also found that humans are much better at understanding categories of nonrigid objects such as “cat”, while current automatic techniques are relatively close to humans in recognizing categories that have distinctive background scenes or audio patterns
Recommended from our members
Joint Audio-Visual Signatures for Web Video Analysis
Presentation of video classification project, including the TRECVID MED2010 system
VAEFL: Integrating variational autoencoders for privacy preservation and performance retention in federated learning
Federated Learning (FL) heralds a paradigm shift in the training of artificial intelligence (AI) models by fostering collaborative model training while safeguarding client data privacy. In sectors where data sensitivity and AI model security are of paramount importance, such as fintech and biomedicine, maintaining the utility of models without compromising privacy is crucial with the growing application of AI technologies. Therefore, the adoption of FL is attracting significant attention. However, traditional FL methods are susceptible to Deep Leakage from Gradients (DLG) attacks, and typical defensive strategies in current research, such as secure multi-party computation and differential privacy, often lead to excessive computational costs or significant decreases in model accuracy. To address DLG attacks in FL, this study introduces VAEFL, an innovative FL framework that incorporates Variational Autoencoders (VAEs) to enhance privacy protection without undermining the predictive prowess of the models. VAEFL strategically partitions the model into a private encoder and a public decoder. The private encoder, remaining local, transmutes sensitive data into a latent space fortified for privacy, while the public decoder and classifier, through collaborative training across clients, learn to derive precise predictions from the encoded data. This bifurcation ensures that sensitive data attributes are not disclosed, circumventing gradient leakage attacks and simultaneously allowing the global model to benefit from the diverse knowledge of client datasets. Comprehensive experiments demonstrate that VAEFL not only surpasses standard FL benchmarks in privacy preservation but also maintains competitive performance in predictive tasks. VAEFL thus establishes a novel equilibrium between data privacy and model utility, offering a secure and efficient FL approach for the sensitive application of FL in the financial domain
The FOXK1-CCDC43 Axis Promotes the Invasion and Metastasis of Colorectal Cancer Cells
Background/Aims: The CCDC43 gene is conserved in human, rhesus monkey, mouse and zebrafish. Bioinformatics studies have demonstrated the abnormal expression of CCDC43 gene in colorectal cancer (CRC). However, the role and molecular mechanism of CCDC43 in CRC remain unknown. Methods: The functional role of CCDC43 and FOXK1 in epithelial-mesenchymal transition (EMT) was determined using immunohistochemistry, flow cytometry, western blot, EdU incorporation, luciferase, chromatin Immunoprecipitation (ChIP) and cell invasion assays. Results: The CCDC43 gene was overexpressed in human CRC. High expression of CCDC43 protein was associated with tumor progression and poor prognosis in patients with CRC. Moreover, the induction of EMT by CCDC43 occurred through TGF-β signaling. Furthermore, a positive correlation between the expression patterns of CCDC43 and FOXK1 was observed in CRC cells. Promoter assays demonstrated that FOXK1 directly bound and activated the human CCDC43 gene promoter. In addition, CCDC43 was necessary for FOXK1- mediated EMT and metastasis in vitro and vivo. Taken together, this work identified that CCDC43 promoted EMT and was a direct transcriptional target of FOXK1 in CRC cells. Conclusion: FOXK1-CCDC43 axis might be helpful to develop the drugs for the treatment of CRC
Large-Scale Video Hashing via Structure Learning
Recently, learning based hashing methods have become popular for indexing large-scale media data. Hashing methods map high-dimensional features to compact binary codes that are efficient to match and robust in preserving original similarity. However, most of the existing hashing methods treat videos as a simple aggregation of independent frames and index each video through combining the indexes of frames. The structure information of videos, e.g., discriminative local visual commonality and temporal consistency, is often neglected in the design of hash functions. In this paper, we propose a supervised method that explores the structure learning techniques to design efficient hash functions. The proposed video hashing method formulates a minimization problem over a structure-regularized empirical loss. In particular, the structure regularization exploits the common local visual patterns occurring in video frames that are associated with the same semantic class, and simultaneously preserves the temporal consistency over successive frames from the same video. We show that the minimization objective can be efficiently solved by an Accelerated Proximal Gradient (APG) method. Extensive experiments on two large video benchmark datasets (up to around 150K video clips with over 12 million frames) show that the proposed method significantly outperforms the state-ofthe-art hashing methods. 1
Columbia-Ucf Trecvid2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, And Temporal Matching
TRECVID Multimedia Event Detection offers an interesting but very challenging task in detecting highlevel complex events (Figure 1) in user-generated videos. In this paper, we will present an overview and comparative analysis of our results, which achieved top performance among all 45 submissions in TRECVID 2010. Our aim is to answer the following questions. What kind of feature is more effective for multimedia event detection? Are features from different feature modalities (e.g., audio and visual) complementary for event detection? Can we benefit from generic concept detection of background scenes, human actions, and audio concepts? Are sequence matching and event-specific object detectors critical? Our findings indicate that spatial-temporal feature is very effective for event detection, and it\u27s also very complementary to other features such as static SIFT and audio features. As a result, our baseline run combining these three features already achieves very impressive results, with a mean minimal normalized cost (MNC) of 0.586. Incorporating the generic concept detectors using a graph diffusion algorithm provides marginal gains (mean MNC 0.579). Sequence matching with Earth Mover\u27s Distance (EMD) further improves the results (mean MNC 0.565). The event-specific detector ( batter ), however, didn\u27t prove useful from our current re-ranking tests. We conclude that it is important to combine strong complementary features from multiple modalities for multimedia event detection, and cross-frame matching is helpful in coping with temporal order variation. Leveraging contextual concept detectors and foreground activities remains a very attractive direction requiring further research