14,341 research outputs found

    Robust Visual Tracking via Hierarchical Convolutional Features

    Full text link
    In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of multiple convolutional layers. These layers encode target appearance with different levels of abstraction. For example, the outputs of the last convolutional layers encode the semantic information of targets and such representations are invariant to significant appearance variations. However, their spatial resolutions are too coarse to precisely localize the target. In contrast, features from earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchical features of convolutional layers as a nonlinear counterpart of an image pyramid representation and explicitly exploit these multiple levels of abstraction to represent target objects. Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance. We infer the maximum response of each layer to locate targets in a coarse-to-fine manner. To further handle the issues with scale estimation and re-detecting target objects from tracking failures caused by heavy occlusion or out-of-the-view movement, we conservatively learn another correlation filter, that maintains a long-term memory of target appearance, as a discriminative classifier. We apply the classifier to two types of object proposals: (1) proposals with a small step size and tightly around the estimated location for scale estimation; and (2) proposals with large step size and across the whole image for target re-detection. Extensive experimental results on large-scale benchmark datasets show that the proposed algorithm performs favorably against state-of-the-art tracking methods.Comment: To appear in T-PAMI 2018, project page at https://sites.google.com/site/chaoma99/hcft-trackin

    Recurrent Filter Learning for Visual Tracking

    Full text link
    Recently using convolutional neural networks (CNNs) has gained popularity in visual tracking, due to its robust feature representation of images. Recent methods perform online tracking by fine-tuning a pre-trained CNN model to the specific target object using stochastic gradient descent (SGD) back-propagation, which is usually time-consuming. In this paper, we propose a recurrent filter generation methods for visual tracking. We directly feed the target's image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking. As the video sequence is a spatiotemporal data, we extend the matrix multiplications of the fully-connected layers of the RNN to a convolution operation on feature maps, which preserves the target's spatial structure and also is memory-efficient. The tracked object in the subsequent frames will be fed into the RNN to adapt the generated filters to appearance variations of the target. Note that once the off-line training process of our network is finished, there is no need to fine-tune the network for specific objects, which makes our approach more efficient than methods that use iterative fine-tuning to online learn the target. Extensive experiments conducted on widely used benchmarks, OTB and VOT, demonstrate encouraging results compared to other recent methods.Comment: ICCV2017 Workshop on VO

    Kernalised Multi-resolution Convnet for Visual Tracking

    Full text link
    Visual tracking is intrinsically a temporal problem. Discriminative Correlation Filters (DCF) have demonstrated excellent performance for high-speed generic visual object tracking. Built upon their seminal work, there has been a plethora of recent improvements relying on convolutional neural network (CNN) pretrained on ImageNet as a feature extractor for visual tracking. However, most of their works relying on ad hoc analysis to design the weights for different layers either using boosting or hedging techniques as an ensemble tracker. In this paper, we go beyond the conventional DCF framework and propose a Kernalised Multi-resolution Convnet (KMC) formulation that utilises hierarchical response maps to directly output the target movement. When directly deployed the learnt network to predict the unseen challenging UAV tracking dataset without any weight adjustment, the proposed model consistently achieves excellent tracking performance. Moreover, the transfered multi-reslution CNN renders it possible to be integrated into the RNN temporal learning framework, therefore opening the door on the end-to-end temporal deep learning (TDL) for visual tracking.Comment: CVPRW 201

    Adversarial Feature Sampling Learning for Efficient Visual Tracking

    Full text link
    The tracking-by-detection framework usually consist of two stages: drawing samples around the target object in the first stage and classifying each sample as the target object or background in the second stage. Current popular trackers based on tracking-by-detection framework typically draw samples in the raw image as the inputs of deep convolution networks in the first stage, which usually results in high computational burden and low running speed. In this paper, we propose a new visual tracking method using sampling deep convolutional features to address this problem. Only one cropped image around the target object is input into the designed deep convolution network and the samples is sampled on the feature maps of the network by spatial bilinear resampling. In addition, a generative adversarial network is integrated into our network framework to augment positive samples and improve the tracking performance. Extensive experiments on benchmark datasets demonstrate that the proposed method achieves a comparable performance to state-of-the-art trackers and accelerates tracking-by-detection trackers based on raw-image samples effectively

    Deep Learning of Appearance Models for Online Object Tracking

    Full text link
    This paper introduces a novel deep learning based approach for vision based single target tracking. We address this problem by proposing a network architecture which takes the input video frames and directly computes the tracking score for any candidate target location by estimating the probability distributions of the positive and negative examples. This is achieved by combining a deep convolutional neural network with a Bayesian loss layer in a unified framework. In order to deal with the limited number of positive training examples, the network is pre-trained offline for a generic image feature representation and then is fine-tuned in multiple steps. An online fine-tuning step is carried out at every frame to learn the appearance of the target. We adopt a two-stage iterative algorithm to adaptively update the network parameters and maintain a probability density for target/non-target regions. The tracker has been tested on the standard tracking benchmark and the results indicate that the proposed solution achieves state-of-the-art tracking results

    Tracking in Aerial Hyperspectral Videos using Deep Kernelized Correlation Filters

    Full text link
    Hyperspectral imaging holds enormous potential to improve the state-of-the-art in aerial vehicle tracking with low spatial and temporal resolutions. Recently, adaptive multi-modal hyperspectral sensors have attracted growing interest due to their ability to record extended data quickly from aerial platforms. In this study, we apply popular concepts from traditional object tracking, namely (1) Kernelized Correlation Filters (KCF) and (2) Deep Convolutional Neural Network (CNN) features to aerial tracking in hyperspectral domain. We propose the Deep Hyperspectral Kernelized Correlation Filter based tracker (DeepHKCF) to efficiently track aerial vehicles using an adaptive multi-modal hyperspectral sensor. We address low temporal resolution by designing a single KCF-in-multiple Regions-of-Interest (ROIs) approach to cover a reasonably large area. To increase the speed of deep convolutional features extraction from multiple ROIs, we design an effective ROI mapping strategy. The proposed tracker also provides flexibility to couple with the more advanced correlation filter trackers. The DeepHKCF tracker performs exceptionally well with deep features set up in a synthetic hyperspectral video generated by the Digital Imaging and Remote Sensing Image Generation (DIRSIG) software. Additionally, we generate a large, synthetic, single-channel dataset using DIRSIG to perform vehicle classification in the Wide Area Motion Imagery (WAMI) platform. This way, the high-fidelity of the DIRSIG software is proved and a large scale aerial vehicle classification dataset is released to support studies on vehicle detection and tracking in the WAMI platform

    Transferring Rich Feature Hierarchies for Robust Visual Tracking

    Full text link
    Convolutional neural network (CNN) models have demonstrated great success in various computer vision tasks including image classification and object detection. However, some equally important tasks such as visual tracking remain relatively unexplored. We believe that a major hurdle that hinders the application of CNN to visual tracking is the lack of properly labeled training data. While existing applications that liberate the power of CNN often need an enormous amount of training data in the order of millions, visual tracking applications typically have only one labeled example in the first frame of each video. We address this research issue here by pre-training a CNN offline and then transferring the rich feature hierarchies learned to online tracking. The CNN is also fine-tuned during online tracking to adapt to the appearance of the tracked target specified in the first video frame. To fit the characteristics of object tracking, we first pre-train the CNN to recognize what is an object, and then propose to generate a probability map instead of producing a simple class label. Using two challenging open benchmarks for performance evaluation, our proposed tracker has demonstrated substantial improvement over other state-of-the-art trackers

    Visual Object Tracking based on Adaptive Siamese and Motion Estimation Network

    Full text link
    Recently, convolutional neural network (CNN) has attracted much attention in different areas of computer vision, due to its powerful abstract feature representation. Visual object tracking is one of the interesting and important areas in computer vision that achieves remarkable improvements in recent years. In this work, we aim to improve both the motion and observation models in visual object tracking by leveraging representation power of CNNs. To this end, a motion estimation network (named MEN) is utilized to seek the most likely locations of the target and prepare a further clue in addition to the previous target position. Hence the motion estimation would be enhanced by generating a small number of candidates near two plausible positions. The generated candidates are then fed into a trained Siamese network to detect the most probable candidate. Each candidate is compared to an adaptable buffer, which is updated under a predefined condition. To take into account the target appearance changes, a weighting CNN (called WCNN) adaptively assigns weights to the final similarity scores of the Siamese network using sequence-specific information. Evaluation results on well-known benchmark datasets (OTB100, OTB50 and OTB2013) prove that the proposed tracker outperforms the state-of-the-art competitors.Comment: 28 pages, 1 algorithm, 7 figures, 2 table, Submitted to Elsevier, Image and Vision Computin

    Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network

    Full text link
    We propose an online visual tracking algorithm by learning discriminative saliency map using Convolutional Neural Network (CNN). Given a CNN pre-trained on a large-scale image repository in offline, our algorithm takes outputs from hidden layers of the network as feature descriptors since they show excellent representation performance in various general visual recognition problems. The features are used to learn discriminative target appearance models using an online Support Vector Machine (SVM). In addition, we construct target-specific saliency map by backpropagating CNN features with guidance of the SVM, and obtain the final tracking result in each frame based on the appearance model generatively constructed with the saliency map. Since the saliency map visualizes spatial configuration of target effectively, it improves target localization accuracy and enable us to achieve pixel-level target segmentation. We verify the effectiveness of our tracking algorithm through extensive experiment on a challenging benchmark, where our method illustrates outstanding performance compared to the state-of-the-art tracking algorithms

    Once for All: a Two-flow Convolutional Neural Network for Visual Tracking

    Full text link
    One of the main challenges of visual object tracking comes from the arbitrary appearance of objects. Most existing algorithms try to resolve this problem as an object-specific task, i.e., the model is trained to regenerate or classify a specific object. As a result, the model need to be initialized and retrained for different objects. In this paper, we propose a more generic approach utilizing a novel two-flow convolutional neural network (named YCNN). The YCNN takes two inputs (one is object image patch, the other is search image patch), then outputs a response map which predicts how likely the object appears in a specific location. Unlike those object-specific approach, the YCNN is trained to measure the similarity between two image patches. Thus it will not be confined to any specific object. Furthermore the network can be end-to-end trained to extract both shallow and deep convolutional features which are dedicated for visual tracking. And once properly trained, the YCNN can be applied to track all kinds of objects without further training and updating. Benefiting from the once-for-all model, our algorithm is able to run at a very high speed of 45 frames-per-second. The experiments on 51 sequences also show that our algorithm achieves an outstanding performance
    • …
    corecore