    The AXES submissions at TrecVid 2013

    The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN) the multimedia event recounting task (MER), and the multimedia event detection task (MED) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP. For SIN, MED and MER, we use systems based on state- of-the-art local low-level descriptors for motion, image, and sound, as well as high-level features to capture speech and text and the visual and audio stream respectively. The low-level descriptors were aggregated by means of Fisher vectors into high- dimensional video-level signatures, the high-level features are aggregated into bag-of-word histograms. Using these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track, as well as in the ad-hoc track. This paper describes in detail our INS, MER, and MED systems and the results and findings of our experimen

    Prioritizing Content of Interest in Multimedia Data Compression

    Image and video compression techniques make data transmission and storage in digital multimedia systems more efficient and feasible for the system's limited storage and bandwidth. Many generic image and video compression techniques such as JPEG and H.264/AVC have been standardized and are now widely adopted. Despite their great success, we observe that these standard compression techniques are not the best solution for data compression in special types of multimedia systems such as microscopy videos and low-power wireless broadcast systems. In these application-specific systems where the content of interest in the multimedia data is known and well-defined, we should re-think the design of a data compression pipeline. We hypothesize that by identifying and prioritizing multimedia data's content of interest, new compression methods can be invented that are far more effective than standard techniques. In this dissertation, a set of new data compression methods based on the idea of prioritizing the content of interest has been proposed for three different kinds of multimedia systems. I will show that the key to designing efficient compression techniques in these three cases is to prioritize the content of interest in the data. The definition of the content of interest of multimedia data depends on the application. First, I show that for microscopy videos, the content of interest is defined as the spatial regions in the video frame with pixels that don't only contain noise. Keeping data in those regions with high quality and throwing out other information yields to a novel microscopy video compression technique. Second, I show that for a Bluetooth low energy beacon based system, practical multimedia data storage and transmission is possible by prioritizing content of interest. I designed custom image compression techniques that preserve edges in a binary image, or foreground regions of a color image of indoor or outdoor objects. Last, I present a new indoor Bluetooth low energy beacon based augmented reality system that integrates a 3D moving object compression method that prioritizes the content of interest.Doctor of Philosoph

    Large-scale interactive exploratory visual search

    Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences

    EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

    We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.ioComment: https://egoschema.github.io

    Video Understanding with Deep Networks

    Video understanding is one of the fundamental problems in computer vision. Videos provide more information to the image recognition task by adding a temporal component through which motion and other information can be additionally used. Encouraged by the success of deep convolutional neural networks (CNNs) on image classification, we extend the deep convolutional networks to video understanding by modeling both spatial and temporal information. To effectively utilize deep networks, we need a comprehensive understanding of convolutional neural networks. We first study the network on the domain of image retrieval. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks and adopt VLAD encoding to encode features into a single vector for each image. Our work provides guidance for transferring deep convolutional networks to other tasks. We then propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose, we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Next, we propose a multitask learning model ActionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. Experiments show that our model effectively learns video representation from motion information on unlabeled videos. While recent deep models for videos show improvement by incorporating optical flow or aggregating high-level appearance across frames, they focus on modeling either the long-term temporal relations or short-term motion. We propose Temporal Difference Networks (TDN) that model both long-term relations and short-term motion from videos. We leverage a simple but effective motion representation: difference of CNN features in our network and jointly modeling the motion at multiple scales in a single CNN

    Signature-based videos’ visual similarity detection and measurement

    The quantity of digital videos is huge, due to technological advances in video capture, storage and compression. However, the usefulness of these enormous volumes is limited by the effectiveness of content-based video retrieval systems (CBVR) that still requires time-consuming annotating/tagging to feed the text-based search. Visual similarity is the core of these CBVR systems where videos are matched based on their respective visual features and their evolvement across video frames. Also, it acts as an essential foundational layer to infer semantic similarity at advanced stage, in collaboration with metadata. Furthermore, handling such amounts of video data, especially the compressed-domain, forces certain challenges for CBVR systems: speed, scalability and genericness. The situation is even more challenging with availability of nonpixelated features, due to compression, e.g. DC/AC coefficients and motion vectors, that requires sophisticated processing. Thus, a careful features’ selection is important to realize the visual similarity based matching within boundaries of the aforementioned challenges. Matching speed is crucial, because most of the current research is biased towards the accuracy and leaves the speed lagging behind, which in many cases affect the practical uses. Scalability is the key for benefiting from these enormous available videos amounts. Genericness is an essential aspect to develop systems that is applicable to, both, compressed and uncompressed videos. This thesis presents a signature-based framework for efficient visual similarity based video matching. The proposed framework represents a vital component for search and retrieval systems, where it could be used in three possible different ways: (1)Directly for CBVR systems where a user submits a query video and the system retrieves a ranked list of visually similar ones. (2)For text-based video retrieval systems, e.g. YouTube, when a user submits a textual description and the system retrieves a ranked list of relevant videos. The retrieval in this case works by finding videos that were manually assigned similar textual description (annotations). For this scenario, the framework could be used to enhance the annotation process. This is achievable by suggesting an annotations-set for the newly uploading videos. These annotations are derived from other visually similar videos that can be retrieved by the proposed framework. In this way, the framework could make annotations more relevant to video contents (compared to the manual way) which improves the overall CBVR systems’ performance as well. (3)The top-N matched list obtained by the framework, could be used as an input to higher layers, e.g. semantic analysis, where it is easier to perform complex processing on this limited set of videos. i The proposed framework contributes and addresses the aforementioned problems, i.e. speed, scalability and genericness, by encoding a given video shot into a single compact fixed-length signature. This signature is able to robustly encode the shot contents for later speedy matching and retrieval tasks. This is in contrast with the current research trend of using an exhaustive complex features/descriptors, e.g. dense trajectories. Moreover, towards a higher matching speed, the framework operates over a sequence of tiny images (DC-images) rather than full size frames. This limits the need to fully decompress compressed-videos, as the DC-images are exacted directly from the compressed stream. The DC-image is highly useful for complex processing, due to its small size compared to the full size frame. In addition, it could be generated from uncompressed videos as well, while the proposed framework is still applicable in the same manner (genericness aspect). Furthermore, for a robust capturing of the visual similarity, scene and motion information are extracted independently, to better address their different characteristics. Scene information is captured using a statistical representation of scene key colours’ profiles, while motion information is captured using a graph-based structure. Then, both information from scene and motion are fused together to generate an overall video signature. The signature’s compact fixedlength aspect contributes to the scalability aspect. This is because, compact fixedlength signatures are highly indexable entities, which facilitates the retrieval process over large-scale video data. The proposed framework is adaptive and provides two different fixed-length video signatures. Both works in a speedy and accurate manner, but with different degrees of matching speed and retrieval accuracy. Such granularity of the signatures is useful to accommodate for different applications’ trade-offs between speed and accuracy. The proposed framework was extensively evaluated using black-box tests for the overall fused signatures and white-box tests for its individual components. The evaluation was done on multiple challenging large-size datasets against a diverse set of state-ofart baselines. The results supported by the quantitative evaluation demonstrated the promisingness of the proposed framework to support real-time applications

    Compact and low-complexity binary feature descriptor and Fisher Vectors for video analytics

    In this paper, we propose a compact and low- complexity binary feature descriptor for video analytics. Our binary descriptor encodes the motion information of a spatio- temporal support region into a low-dimensional binary string. The descriptor is based on a binning strategy and a construction that binarizes separately the horizontal and vertical motion components of the spatio-temporal support region. We pair our descriptor with a novel Fisher Vector (FV) scheme for binary data to project a set of binary features into a fixed length vector in order to evaluate the similarity between feature sets. We test the effectiveness of our binary feature descriptor with FVs for action recognition, which is one of the most challenging tasks in computer vision, as well as gait recognition and animal behavior clustering. Several experiments on the KTH, UCF50, UCF101, CASIA-B, and TIGdog datasets show that the proposed binary feature descriptor outperforms the state-of-the-art feature descriptors in terms of computational time and memory and stor- age requirements. When paired with FVs, the proposed feature descriptor attains a very competitive performance, outperforming several state-of-the-art feature descriptors and some methods based on convolutional neural networks

    StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

    Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area
