145 research outputs found

    Feature extraction using MPEG-CDVS and Deep Learning with application to robotic navigation and image classification

    Get PDF
    The main contributions of this thesis are the evaluation of MPEG Compact Descriptor for Visual Search in the context of indoor robotic navigation and the introduction of a new method for training Convolutional Neural Networks with applications to object classification. The choice for image descriptor in a visual navigation system is not straightforward. Visual descriptors must be distinctive enough to allow for correct localisation while still offering low matching complexity and short descriptor size for real-time applications. MPEG Compact Descriptor for Visual Search is a low complexity image descriptor that offers several levels of compromises between descriptor distinctiveness and size. In this work, we describe how these trade-offs can be used for efficient loop-detection in a typical indoor environment. We first describe a probabilistic approach to loop detection based on the standard’s suggested similarity metric. We then evaluate the performance of CDVS compression modes in terms of matching speed, feature extraction, and storage requirements and compare them with the state of the art SIFT descriptor for five different types of indoor floors. During the second part of this thesis we focus on the new paradigm to machine learning and computer vision called Deep Learning. Under this paradigm visual features are no longer extracted using fine-grained, highly engineered feature extractor, but rather using a Convolutional Neural Networks (CNN) that extracts hierarchical features learned directly from data at the cost of long training periods. In this context, we propose a method for speeding up the training of Convolutional Neural Networks (CNN) by exploiting the spatial scaling property of convolutions. This is done by first training a pre-train CNN of smaller kernel resolutions for a few epochs, followed by properly rescaling its kernels to the target’s original dimensions and continuing training at full resolution. We show that the overall training time of a target CNN architecture can be reduced by exploiting the spatial scaling property of convolutions during early stages of learning. Moreover, by rescaling the kernels at different epochs, we identify a trade-off between total training time and maximum obtainable accuracy. Finally, we propose a method for choosing when to rescale kernels and evaluate our approach on recent architectures showing savings in training times of nearly 20% while test set accuracy is preserved

    Who is the director of this movie? Automatic style recognition based on shot features

    Get PDF
    We show how low-level formal features, such as shot duration, meant as length of camera takes, and shot scale, i.e. the distance between the camera and the subject, are distinctive of a director's style in art movies. So far such features were thought of not having enough varieties to become distinctive of an author. However our investigation on the full filmographies of six different authors (Scorsese, Godard, Tarr, Fellini, Antonioni, and Bergman) for a total number of 120 movies analysed second by second, confirms that these shot-related features do not appear as random patterns in movies from the same director. For feature extraction we adopt methods based on both conventional and deep learning techniques. Our findings suggest that feature sequential patterns, i.e. how features evolve in time, are at least as important as the related feature distributions. To the best of our knowledge this is the first study dealing with automatic attribution of movie authorship, which opens up interesting lines of cross-disciplinary research on the impact of style on the aesthetic and emotional effects on the viewers

    ARCHANGEL: Tamper-proofing Video Archives using Temporal Content Hashes on the Blockchain

    Get PDF
    We present ARCHANGEL; a novel distributed ledger based system for assuring the long-term integrity of digital video archives. First, we describe a novel deep network architecture for computing compact temporal content hashes (TCHs) from audio-visual streams with durations of minutes or hours. Our TCHs are sensitive to accidental or malicious content modification (tampering) but invariant to the codec used to encode the video. This is necessary due to the curatorial requirement for archives to format shift video over time to ensure future accessibility. Second, we describe how the TCHs (and the models used to derive them) are secured via a proof-of-authority blockchain distributed across multiple independent archives. We report on the efficacy of ARCHANGEL within the context of a trial deployment in which the national government archives of the United Kingdom, Estonia and Norway participated.Comment: Accepted to CVPR Blockchain Workshop 201

    Visual Analysis Algorithms for Embedded Systems

    Get PDF
    Visual search systems are very popular applications, but on-line versions in 3G wireless environments suffer from network constraint like unstable or limited bandwidth that entail latency in query delivery, significantly degenerating the user’s experience. An alternative is to exploit the ability of the newest mobile devices to perform heterogeneous activities, like not only creating but also processing images. Visual feature extraction and compression can be performed on on-board Graphical Processing Units (GPUs), making smartphones capable of detecting a generic object (matching) in an exact way or of performing a classification activity. The latest trends in visual search have resulted in dedicated efforts in MPEG standardization, namely the MPEG CDVS (Compact Descriptor for Visual Search) standard. CDVS is an ISO/IEC standard used to extract a compressed descriptor. As regards to classification, in recent years neural networks have acquired an impressive importance and have been applied to several domains. This thesis focuses on the use of Deep Neural networks to classify images by means of Deep learning. Implementing visual search algorithms and deep learning-based classification on embedded environments is not a mere code-porting activity. Recent embedded devices are equipped with a powerful but limited number of resources, like development boards such as GPGPUs. GPU architectures fit particularly well, because they allow to execute more operations in parallel, following the SIMD (Single Instruction Multiple Data) paradigm. Nonetheless, it is necessary to make good design choices for the best use of available hardware and memory. For visual search, following the MPEG CDVS standard, the contribution of this thesis is an efficient feature computation phase, a parallel CDVS detector, completely implemented on embedded devices supporting the OpenCL framework. Algorithmic choices and implementation details to target the intrinsic characteristics of the selected embedded platforms are presented and discussed. Experimental results on several GPUs show that the GPU-based solution is up to 7× faster than the CPU-based one. This speed-up opens new visual search scenarios exploiting entire real-time on-board computations with no data transfer. As regards to the use of Deep convolutional neural networks for off-line image classification, their computational and memory requirements are huge, and this is an issue on embedded devices. Most of the complexity derives from the convolutional layers and in particular from the matrix multiplications they entail. The contribution of this thesis is a self-contained implementation to image classification providing common layers used in neural networks. The approach relies on a heterogeneous CPU-GPU scheme for performing convolutions in the transform domain. Experimental results show that the heterogeneous scheme described in this thesis boasts a 50× speedup over the CPU-only reference and outperforms a GPU-based reference by 2×, while slashing the power consumption by nearly 30%

    Detecção de pornografia em vídeos através de técnicas de aprendizado profundo e informações de movimento

    Get PDF
    Orientadores: Anderson de Rezende Rocha, Vanessa TestoniDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Com o crescimento exponencial de gravações em vídeos disponíveis online, a moderação manual de conteúdos sensíveis, e.g, pornografia, violência e multidões, se tornou impra- ticável, aumentando a necessidade de uma filtragem automatizada. Nesta linha, muitos trabalhos exploraram o problema de detecção de pornografia, usando abordagens que vão desde a detecção de pele e nudez, até o uso de características locais e sacola de pala- vras visuais. Contudo, essas técnicas sofrem com casos ambíguos (e.g., cenas em praia, luta livre), produzindo muitos falsos positivos. Isto está possivelmente relacionado com o fato de que essas abordagens estão desatualizadas, e de que poucos autores usaram a informação de movimento presente nos vídeos, que pode ser crucial para a desambi- guação visual dos casos mencionados. Indo adiante para superar estas questões, neste trabalho, nós exploramos soluções de aprendizado em profundidade para o problema de detecção de pornografia em vídeos, levando em consideração tanto a informação está- tica, quanto a informação de movimento disponível em cada vídeo em questão. Quando combinamos as características estáticas e de movimento, o método proposto supera as soluções existentes na literatura. Apesar de as abordagens de aprendizado em profun- didade, mais especificamente as Redes Neurais Convolucionais (RNC), terem alcançado resultados impressionantes em outros problemas de visão computacional, este método tão promissor ainda não foi explorado suficientemente no problema detecção de pornografia, principalmente no que tange à incorporação de informações de movimento presente no vídeo. Adicionalmente, propomos novas formas de combinar as informações estáticas e de movimento usando RNCs, que ainda não foram exploradas para detecção de pornografia, nem em outras tarefas de reconhecimento de ações. Mais especificamente, nós exploramos duas fontes distintas de informação de movimento: Campos de deslocamento de Fluxo Óptico, que tem sido tradicionalmente usados para classificação de vídeos; e Vetores de Movimento MPEG. Embora Vetores de Movimento já tenham sido utilizados pela litera- tura na tarefa de detecção de pornografia, neste trabalho nós os adaptamos, criando uma representação visual apropriada, antes de passá-los a uma rede neural convolucional para aprendizado e extração de características. Nossos experimentos mostraram que, apesar de a técnica de Vetores de Movimento MPEG possuir uma performance inferior quando utilizada de forma isolada, quando comparada à técnica baseada em Fluxo Óptico, ela consegue uma performance similar ao complementar a informação estática, com a van- tagem de estar presente, por construção, nos vídeos, enquanto se decodifica os frames, evitando a necessidade da computação mais cara do Fluxo Óptico. Nossa melhor aborda- gem proposta supera os métodos existentes na literatura em diferentes datasets. Para o dataset Pornography 800, o método consegue uma acurácia de classificação de 97,9%, uma redução do erro de 64,4% quando comparado com o estado da arte (94,1% de acu- rácia neste dataset). Quando consideramos o dataset Pornography 2k, mais desafiador, nosso melhor método consegue um acurácia de 96,4%, reduzindo o erro de classificação em 14,3% em comparação ao estado da arte (95,8%)Abstract: With the exponential growth of video footage available online, human manual moderation of sensitive scenes, e.g., pornography, violence and crowd, became infeasible, increasing the necessity for automated filtering. In this vein, a great number of works has explored the pornographic detection problem, using approaches ranging from skin and nudity de- tection, to local features and bag of visual words. Yet, these techniques suffer from some ambiguous cases (e.g., beach scenes, wrestling), producing too much false positives. This is possibly related to the fact that these approaches are somewhat outdated, and that few authors have used the motion information present in videos, which could be crucial for the visual disambiguation of these cases. Setting forth to overcome these issues, in this work, we explore deep learning solutions to the problem of pornography detection in videos, tak- ing into account both the static and the motion information available for each questioned video. When incorporating the static and motion complementary features, the proposed method outperforms the existing solutions in the literature. Although Deep Learning ap- proaches, more specifically Convolutional Neural Networks (CNNs), have achieved striking results on other vision-related problems, such promising methods are still not sufficiently explored in pornography detection while incorporating motion information. We also pro- pose novel ways for combining the static and the motion information using CNNs, that have not been explored in pornography detection, nor in other action recognition tasks before. More specifically, we explore two distinct sources of motion information herein: Optical Flow displacement fields, which have been traditionally used for video classifica- tion; and MPEG Motion Vectors. Although Motion Vectors have already been used for pornography detection tasks in the literature, in this work, we adapt them, by finding an appropriate visual representation, before feeding a convolution neural network for feature learning and extraction. Our experiments show that although the MPEG Motion Vectors technique has an inferior performance by itself, than when using its Optical Flow coun- terpart, it yields a similar performance when complementing the static information, with the advantage of being present, by construction, in the video while decoding the frames, avoiding the need for the more expensive Optical Flow calculations. Our best approach outperforms existing methods in the literature when considering different datasets. For the Pornography 800 dataset, it yields a classification accuracy of 97.9%, an error re- duction of 64.4% when compared to the state of the art (94.1% in this dataset). Finally, considering the more challenging Pornography 2k dataset, our best method yields a clas- sification accuracy of 96.4%, reducing the classification error in 14.3% when compared to the state of the art (95.8% in the same dataset)MestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoFuncampCAPE

    Learning to Detect Violent Videos using Convolutional Long Short-Term Memory

    Full text link
    Developing a technique for the automatic analysis of surveillance videos in order to identify the presence of violence is of broad interest. In this work, we propose a deep neural network for the purpose of recognizing violent videos. A convolutional neural network is used to extract frame level features from a video. The frame level features are then aggregated using a variant of the long short term memory that uses convolutional gates. The convolutional neural network along with the convolutional long short term memory is capable of capturing localized spatio-temporal features which enables the analysis of local motion taking place in the video. We also propose to use adjacent frame differences as the input to the model thereby forcing it to encode the changes occurring in the video. The performance of the proposed feature extraction pipeline is evaluated on three standard benchmark datasets in terms of recognition accuracy. Comparison of the results obtained with the state of the art techniques revealed the promising capability of the proposed method in recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal based Surveillance(AVSS 2017

    Structural learning for large scale image classification

    Get PDF
    To leverage large-scale collaboratively-tagged (loosely-tagged) images for training a large number of classifiers to support large-scale image classification, we need to develop new frameworks to deal with the following issues: (1) spam tags, i.e., tags are not relevant to the semantic of the images; (2) loose object tags, i.e., multiple object tags are loosely given at the image level without their locations in the images; (3) missing object tags, i.e. some object tags are missed due to incomplete tagging; (4) inter-related object classes, i.e., some object classes are visually correlated and their classifiers need to be trained jointly instead of independently; (5) large scale object classes, which requires to limit the computational time complexity for classifier training algorithms as well as the storage spaces for intermediate results. To deal with these issues, we propose a structural learning framework which consists of the following key components: (1) cluster-based junk image filtering to address the issue of spam tags; (2) automatic tag-instance alignment to address the issue of loose object tags; (3) automatic missing object tag prediction; (4) object correlation network for inter-class visual correlation characterization to address the issue of missing tags; (5) large-scale structural learning with object correlation network for enhancing the discrimination power of object classifiers. To obtain enough numbers of labeled training images, our proposed framework leverages the abundant web images and their social tags. To make those web images usable, tag cleansing has to be done to neutralize the noise from user tagging preferences, in particularly junk tags, loose tags and missing tags. Then a discriminative learning algorithm is developed to train a large number of inter-related classifiers for achieving large-scale image classification, e.g., learning a large number of classifiers for categorizing large-scale images into a large number of inter-related object classes and image concepts. A visual concept network is first constructed for organizing enumorus object classes and image concepts according to their inter-concept visual correlations. The visual concept network is further used to: (a) identify inter-related learning tasks for classifier training; (b) determine groups of visually-similar object classes and image concepts; and (c) estimate the learning complexity for classifier training. A large-scale discriminative learning algorithm is developed for supporting multi-class classifier training and achieving accurate inter-group discrimination and effective intra-group separation. Our discriminative learning algorithm can significantly enhance the discrimination power of the classifiers and dramatically reduce the computational cost for large-scale classifier training
    corecore