152 research outputs found

    Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor

    Get PDF
    We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.Comment: Accepted in IEEE Transactions on Circuits and Systems for Video Technology. Extension of ICIP 2017 conference pape

    H4VDM: H.264 Video Device Matching

    Full text link
    Methods that can determine if two given video sequences are captured by the same device (e.g., mobile telephone or digital camera) can be used in many forensics tasks. In this paper we refer to this as "video device matching". In open-set video forensics scenarios it is easier to determine if two video sequences were captured with the same device than identifying the specific device. In this paper, we propose a technique for open-set video device matching. Given two H.264 compressed video sequences, our method can determine if they are captured by the same device, even if our method has never encountered the device in training. We denote our proposed technique as H.264 Video Device Matching (H4VDM). H4VDM uses H.264 compression information extracted from video sequences to make decisions. It is more robust against artifacts that alter camera sensor fingerprints, and it can be used to analyze relatively small fragments of the H.264 sequence. We trained and tested our method on a publicly available video forensics dataset consisting of 35 devices, where our proposed method demonstrated good performance

    Deep Learning for Action Understanding in Video

    Get PDF
    Action understanding is key to automatically analyzing video content and thus is important for many real-world applications such as autonomous driving car, robot-assisted care, etc. Therefore, in the computer vision field, action understanding has been one of the fundamental research topics. Most conventional methods for action understanding are based on hand-crafted features. Like the recent advances seen in image classification, object detection, image captioning, etc, deep learning has become a popular approach for action understanding in video. However, there remain several important research challenges in developing deep learning based methods for understanding actions. This thesis focuses on the development of effective deep learning methods for solving three major challenges. Action detection at fine granularities in time: Previous work in deep learning based action understanding mainly focuses on exploring various backbone networks that are designed for the video-level action classification task. These did not explore the fine-grained temporal characteristics and thus failed to produce temporally precise estimation of action boundaries. In order to understand actions more comprehensively, it is important to detect actions at finer granularities in time. In Part I, we study both segment-level action detection and frame-level action detection. Segment-level action detection is usually formulated as the temporal action localization task, which requires not only recognizing action categories for the whole video but also localizing the start time and end time of each action instance. To this end, we propose an effective multi-stage framework called Segment-CNN consisting of three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. In another approach, frame-level action detection is effectively formulated as the per-frame action labeling task. We combine two reverse operations (i.e. convolution and deconvolution) into a joint Convolutional-De-Convolutional (CDC) filter, which simultaneously conducts downsampling in space and upsampling in time to jointly model both high-level semantics and temporal dynamics. We design a novel CDC network to predict actions at frame-level and the frame-level predictions can be further used to detect precise segment boundary for the temporal action localization task. Our method not only improves the state-of-the-art mean Average Precision (mAP) result on THUMOS’14 from 41.3% to 44.4% for the per-frame labeling task, but also improves mAP for the temporal action localization task from 19.0% to 23.3% on THUMOS’14 and from 16.4% to 23.8% on ActivityNet v1.3. Action detection in the constrained scenarios: The usual training process of deep learning models consists of supervision and data, which are not always available in reality. In Part II, we consider the scenarios of incomplete supervision and incomplete data. For incomplete supervision, we focus on the weakly-supervised temporal action localization task and propose AutoLoc which is the first framework that can directly predict the temporal boundary of each action instance with only the video-level annotations available during training. To enable the training of such a boundary prediction model, we design a novel Outer-Inner-Contrastive (OIC) loss to help discover the segment-level supervision and we prove that the OIC loss is differentiable to the underlying boundary prediction model. Our method significantly improves mAP on THUMOS14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. For the scenario of incomplete data, we formulate a novel task called Online Detection of Action Start (ODAS) in streaming videos to enable detecting the action start time on the fly when a live video action is just starting. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. Specifically, we propose three novel methods to address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. Action understanding in the compressed domain: The mainstream action understanding methods including the aforementioned techniques developed by us require first decoding the compressed video into RGB image frames. This may result in significant cost in terms of storage and computation. Recently, researchers started to investigate how to directly perform action understanding in the compressed domain in order to achieve high efficiency while maintaining the state-of-the-art action detection accuracy. The key research challenge is developing effective backbone networks that can directly take data in the compressed domain as input. Our baseline is to take models developed for action understanding in the decoded domain and adapt them to attack the same tasks in the compressed domain. In Part III, we address two important issues in developing the backbone networks that exclusively operate in the compressed domain. First, compressed videos may be produced by different encoders or encoding parameters, but it is impractical to train a different compressed-domain action understanding model for each different format. We experimentally analyze the effect of video encoder variation and develop a simple yet effective training data preparation method to alleviate the sensitivity to encoder variation. Second, motion cues have been shown to be important for action understanding, but the motion vectors in compressed video are often very noisy and not discriminative enough for directly performing accurate action understanding. We develop a novel and highly efficient framework called DMC-Net that can learn to predict discriminative motion cues based on noisy motion vectors and residual errors in the compressed video streams. On three action recognition benchmarks, namely HMDB-51, UCF101 and a subset of Kinetics, we demonstrate that our DMC-Net can significantly shorten the performance gap between state-of-the-art compressed video based methods with and without optical flow, while being two orders of magnitude faster than the methods that use optical flow. By addressing the three major challenges mentioned above, we are able to develop more robust models for video action understanding and improve performance in various dimensions, such as (1) temporal precision, (2) required levels of supervision, (3) live video analysis ability, and finally (4) efficiency in processing compressed video. Our research has contributed significantly to advancing the state of the art of video action understanding and expanding the foundation for comprehensive semantic understanding of video content

    Methods and Tools for Image and Video Quality Assessment

    Get PDF
    Disertační práce se zabývá metodami a prostředky pro hodnocení kvality obrazu ve videosekvencích, což je velmi aktuální téma, zažívající velký rozmach zejména v souvislosti s digitálním zpracováním videosignálů. Přestože již existuje relativně velké množství metod a metrik pro objektivní, tedy automatizované měření kvality videosekvencí, jsou tyto metody zpravidla založeny na porovnání zpracované (poškozené, například komprimací) a originální videosekvence. Metod pro hodnocení kvality videosekvení bez reference, tedy pouze na základě analýzy zpracovaného materiálu, je velmi málo. Navíc se takové metody převážně zaměřují na analýzu hodnot signálu (typicky jasu) v jednotlivých obrazových bodech dekódovaného signálu, což je jen těžko aplikovatelné pro moderní komprimační algoritmy jako je H.264/AVC, který používá sofistikovené techniky pro odstranění komprimačních artefaktů. V práci je nejprve podán stučný přehled dostupných metod pro objektivní hodnocení komprimovaných videosekvencí se zdůrazněním rozdílného principu metod využívajících referenční materiál a metod pracujících bez reference. Na základě analýzy možných přístupů pro hodnocení video sekvencí komprimovaných moderními komprimačními algoritmy je v dalším textu práce popsán návrh nové metody určené pro hodnocení kvality obrazu ve videosekvencích komprimovaných s využitím algoritmu H.264/AVC. Nová metoda je založena na sledování hodnot parametrů, které jsou obsaženy v transportním toku komprimovaného videa, a přímo souvisí s procesem kódování. Nejprve je provedena úvaha nad vlivem některých takových parametrů na kvalitu výsledného videa. Následně je navržen algoritmus, který s využitím umělé neuronové sítě určuje špičkový poměr signálu a šumu (peak signal-to-noise ratio -- PSNR) v komprimované videosekvenci -- plně referenční metrika je tedy nahrazována metrikou bez reference. Je ověřeno několik konfigurací umělých neuronových sítí od těch nejjednodušších až po třívrstvé dopředné sítě. Pro učení sítí a následnou analýzu jejich výkonnosti a věrnosti určení PSNR jsou vytvořeny dva soubory nekomprimovaných videosekvencí, které jsou následně komprimovány algoritmem H.264/AVC s proměnným nastavením kodéru. V závěrečné části práce je proveden rozbor chování nově navrženého algoritmu v případě, že se změní vlastnosti zpracovávaného videa (rozlišení, střih), případně kodéru (formát skupiny současně kódovaných snímků). Chování algoritmu je analyzováno až do plného vysokého rozlišení zdrojového signálu (full HD -1920 x 1080 obrazových bodů).The doctoral thesis is focused on methods and tools for image quality assessment in video sequences, which is a very up-to-date theme, undergoing a rapid evolution with respect to digital video signal processing, in particular. Although a variety of metrics for objective (automated) video sequence quality measurement has been developed recently, these methods are mostly based on comparison of the processed (damaged, e.g. with compression) and original video sequences. There are very few methods operating without reference, i.e. only on the processed video material. Moreover, such methods are usually analyzing signal values (typically luminance) in picture elements of the decoded signal, which is hardly applicable for modern compression algorithms such as the H.264/AVC as they use sophisticated techniques to remove compression artifacts. The thesis first gives a brief overview of the available metrics for objective quality measurements of compressed video sequences, emphasizing the different approach of full-reference and no-reference methods. Based on an analysis of possible ideas for measuring quality of video sequences compressed using modern compression algorithms, the thesis describes the design process of a new quality metric for video sequences compressed with the H.264/AVC algorithm. The new method is based on monitoring of several parameters, present in the transport stream of the compressed video and directly related to the encoding process. The impact of bitstream parameters on the video quality is considered first. Consequently, an algorithm is designed, employing an artificial neural network to estimate the peak signal-to-noise ratios (PSNR) of the compressed video sequences -- a full-reference metric is thus replaced by a no--reference metric. Several neural network configurations are verified, reaching from the simplest to three-layer feedforward networks. Two sets of video sequences are constructed to train the networks and analyze their performance and fidelity of estimated PSNRs. The sequences are compressed using the H.264/AVC algorithm with variable encoder configuration. The final part of the thesis deals with an analysis of behavior of the newly designed algorithm, provided the properties of the processed video are changed (resolution, cut) or encoder configuration is altered (format of group of pictures coded together). The analysis is done on video sequences with resolution up to full HD (1920 x 1080 pixels, progressive)
    corecore