744 research outputs found

    An In-Depth Analysis of Visual Tracking with Siamese Neural Networks

    Full text link
    This survey presents a deep analysis of the learning and inference capabilities in nine popular trackers. It is neither intended to study the whole literature nor is it an attempt to review all kinds of neural networks proposed for visual tracking. We focus instead on Siamese neural networks which are a promising starting point for studying the challenging problem of tracking. These networks integrate efficiently feature learning and the temporal matching and have so far shown state-of-the-art performance. In particular, the branches of Siamese networks, their layers connecting these branches, specific aspects of training and the embedding of these networks into the tracker are highlighted. Quantitative results from existing papers are compared with the conclusion that the current evaluation methodology shows problems with the reproducibility and the comparability of results. The paper proposes a novel Lisp-like formalism for a better comparison of trackers. This assumes a certain functional design and functional decomposition of trackers. The paper tries to give foundation for tracker design by a formulation of the problem based on the theory of machine learning and by the interpretation of a tracker as a decision function. The work concludes with promising lines of research and suggests future work.Comment: submitted to IEEE TPAM

    Extraction and Classification of Diving Clips from Continuous Video Footage

    Full text link
    Due to recent advances in technology, the recording and analysis of video data has become an increasingly common component of athlete training programmes. Today it is incredibly easy and affordable to set up a fixed camera and record athletes in a wide range of sports, such as diving, gymnastics, golf, tennis, etc. However, the manual analysis of the obtained footage is a time-consuming task which involves isolating actions of interest and categorizing them using domain-specific knowledge. In order to automate this kind of task, three challenging sub-problems are often encountered: 1) temporally cropping events/actions of interest from continuous video; 2) tracking the object of interest; and 3) classifying the events/actions of interest. Most previous work has focused on solving just one of the above sub-problems in isolation. In contrast, this paper provides a complete solution to the overall action monitoring task in the context of a challenging real-world exemplar. Specifically, we address the problem of diving classification. This is a challenging problem since the person (diver) of interest typically occupies fewer than 1% of the pixels in each frame. The model is required to learn the temporal boundaries of a dive, even though other divers and bystanders may be in view. Finally, the model must be sensitive to subtle changes in body pose over a large number of frames to determine the classification code. We provide effective solutions to each of the sub-problems which combine to provide a highly functional solution to the task as a whole. The techniques proposed can be easily generalized to video footage recorded from other sports.Comment: To appear at CVsports 201

    Multiple People Tracking Using Hierarchical Deep Tracklet Re-identification

    Full text link
    The task of multiple people tracking in monocular videos is challenging because of the numerous difficulties involved: occlusions, varying environments, crowded scenes, camera parameters and motion. In the tracking-by-detection paradigm, most approaches adopt person re-identification techniques based on computing the pairwise similarity between detections. However, these techniques are less effective in handling long-term occlusions. By contrast, tracklet (a sequence of detections) re-identification can improve association accuracy since tracklets offer a richer set of visual appearance and spatio-temporal cues. In this paper, we propose a tracking framework that employs a hierarchical clustering mechanism for merging tracklets. To this end, tracklet re-identification is performed by utilizing a novel multi-stage deep network that can jointly reason about the visual appearance and spatio-temporal properties of a pair of tracklets, thereby providing a robust measure of affinity. Experimental results on the challenging MOT16 and MOT17 benchmarks show that our method significantly outperforms state-of-the-arts.Comment: 13 pages (8 main + 2 bibliography + 5 appendices

    Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey

    Full text link
    Deep learning has recently achieved very promising results in a wide range of areas such as computer vision, speech recognition and natural language processing. It aims to learn hierarchical representations of data by using deep architecture models. In a smart city, a lot of data (e.g. videos captured from many distributed sensors) need to be automatically processed and analyzed. In this paper, we review the deep learning algorithms applied to video analytics of smart city in terms of different research topics: object detection, object tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure

    Unsupervised Learning Layers for Video Analysis

    Full text link
    This paper presents two unsupervised learning layers (UL layers) for label-free video analysis: one for fully connected layers, and the other for convolutional ones. The proposed UL layers can play two roles: they can be the cost function layer for providing global training signal; meanwhile they can be added to any regular neural network layers for providing local training signals and combined with the training signals backpropagated from upper layers for extracting both slow and fast changing features at layers of different depths. Therefore, the UL layers can be used in either pure unsupervised or semi-supervised settings. Both a closed-form solution and an online learning algorithm for two UL layers are provided. Experiments with unlabeled synthetic and real-world videos demonstrated that the neural networks equipped with UL layers and trained with the proposed online learning algorithm can extract shape and motion information from video sequences of moving objects. The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization

    Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking

    Full text link
    Most thermal infrared (TIR) tracking methods are discriminative, treating the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is coupled well to the objective of the tracking task. We propose a TIR tracker via a Hierarchical Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN that coalesces the multiple hierarchical convolutional layers. Then, we propose a spatial-aware network to enhance the discriminative ability of the coalesced hierarchical feature. Subsequently, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the candidate that is most similar to the tracked target. Extensive experimental results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed method achieves favourable performance compared to the state-of-the-art methods.Comment: 20 pages, 7 figure

    End-to-end Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning

    Full text link
    The majority of contemporary object-tracking approaches do not model interactions between objects. This contrasts with the fact that objects' paths are not independent: a cyclist might abruptly deviate from a previously planned trajectory in order to avoid colliding with a car. Building upon HART, a neural class-agnostic single-object tracker, we introduce a multi-object tracking method MOHART capable of relational reasoning. Importantly, the entire system, including the understanding of interactions and relations between objects, is class-agnostic and learned simultaneously in an end-to-end fashion. We explore a number of relational reasoning architectures and show that permutation-invariant models outperform non-permutation-invariant alternatives. We also find that architectures using a single permutation invariant operation like DeepSets, despite, in theory, being universal function approximators, are nonetheless outperformed by a more complex architecture based on multi-headed attention. The latter better accounts for complex physical interactions in a challenging toy experiment. Further, we find that modelling interactions leads to consistent performance gains in tracking as well as future trajectory prediction on three real-world datasets (MOTChallenge, UA-DETRAC, and Stanford Drone dataset), particularly in the presence of ego-motion, occlusions, crowded scenes, and faulty sensor inputs

    A Review on Deep Learning Techniques Applied to Semantic Segmentation

    Full text link
    Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.Comment: Submitted to TPAMI on Apr. 22, 201

    Downstream Task Self-Supervised Learning for Object Recognition and Tracking

    Get PDF
    This dissertation addresses three limitations of deep learning methods in image and video understanding-based machine vision applications. Firstly, although deep convolutional neural networks (CNNs) are efficient for image recognition applications such as object detection and segmentation, they perform poorly under perspective distortions. In real-world applications, the camera perspective is a common problem that we can address by annotating large amounts of data, thus limiting the applicability of the deep learning models. Secondly, the typical approach for single-camera tracking problems is to use separate motion and appearance models, which are expensive in terms of computations and training data requirements. Finally, conventional multi-camera video understanding techniques use supervised learning algorithms to determine temporal relationships among objects. In large-scale applications, these methods are also limited by the requirement of extensive manually annotated data and computational resources.To address these limitations, we develop an uncertainty-aware self-supervised learning (SSL) technique that captures a model\u27s instance or semantic segmentation uncertainty from overhead images and guides the model to learn the impact of the new perspective on object appearance. The test-time data augmentation-based pseudo-label refinement technique continuously trains a model until convergence on new perspective images. The proposed method can be applied for both self-supervision and semi-supervision, thus increasing the effectiveness of a deep pre-trained model in new domains. Extensive experiments demonstrate the effectiveness of the SSL technique in both object detection and semantic segmentation problems. In video understanding applications, we introduce simultaneous segmentation and tracking as an unsupervised spatio-temporal latent feature clustering problem. The jointly learned multi-task features leverage the task-dependent uncertainty to generate discriminative features in multi-object videos. Experiments have shown that the proposed tracker outperforms several state-of-the-art supervised methods. Finally, we proposed an unsupervised multi-camera tracklet association (MCTA) algorithm to track multiple objects in real-time. MCTA leverages the self-supervised detector model for single-camera tracking and solves the multi-camera tracking problem using multiple pair-wise camera associations modeled as a connected graph. The graph optimization method generates a global solution for partially or fully overlapping camera networks

    Spatiotemporal Filtering for Event-Based Action Recognition

    Full text link
    In this paper, we address the challenging problem of action recognition, using event-based cameras. To recognise most gestural actions, often higher temporal precision is required for sampling visual information. Actions are defined by motion, and therefore, when using event-based cameras it is often unnecessary to re-sample the entire scene. Neuromorphic, event-based cameras have presented an alternative to visual information acquisition by asynchronously time-encoding pixel intensity changes, through temporally precise spikes (10 micro-second resolution), making them well equipped for action recognition. However, other challenges exist, which are intrinsic to event-based imagers, such as higher signal-to-noise ratio, and a spatiotemporally sparse information. One option is to convert event-data into frames, but this could result in significant temporal precision loss. In this work we introduce spatiotemporal filtering in the spike-event domain, as an alternative way of channeling spatiotemporal information through to a convolutional neural network. The filters are local spatiotemporal weight matrices, learned from the spike-event data, in an unsupervised manner. We find that appropriate spatiotemporal filtering significantly improves CNN performance beyond state-of-the-art on the event-based DVS Gesture dataset. On our newly recorded action recognition dataset, our method shows significant improvement when compared with other, standard ways of generating the spatiotemporal filters.Comment: Submitted to IEEE Transactions in Pattern Analysis and Machine Intelligenc
    • …
    corecore