970 research outputs found
Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking
Discriminative Correlation Filters (DCF) have demonstrated excellent
performance for visual object tracking. The key to their success is the ability
to efficiently exploit available negative data by including all shifted
versions of a training sample. However, the underlying DCF formulation is
restricted to single-resolution feature maps, significantly limiting its
potential. In this paper, we go beyond the conventional DCF framework and
introduce a novel formulation for training continuous convolution filters. We
employ an implicit interpolation model to pose the learning problem in the
continuous spatial domain. Our proposed formulation enables efficient
integration of multi-resolution deep feature maps, leading to superior results
on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color
(+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate).
Additionally, our approach is capable of sub-pixel localization, crucial for
the task of accurate feature point tracking. We also demonstrate the
effectiveness of our learning formulation in extensive feature point tracking
experiments. Code and supplementary material are available at
http://www.cvl.isy.liu.se/research/objrec/visualtracking/conttrack/index.html.Comment: Accepted at ECCV 201
Learn what matters: cross-domain imitation learning with task-relevant embeddings
We study how an autonomous agent learns to perform a task from demonstrations in a different domain, such as a different environment or different agent. Such cross-domain imitation learning is required to, for example, train an artificial agent from demonstrations of a human expert. We propose a scalable framework that enables cross-domain imitation learning without access to additional demonstrations or further domain knowledge. We jointly train the learner agent's policy and learn a mapping between the learner and expert domains with adversarial training. We effect this by using a mutual information criterion to find an embedding of the expert's state space that contains task-relevant information and is invariant to domain specifics. This step significantly simplifies estimating the mapping between the learner and expert domains and hence facilitates end-to-end learning. We demonstrate successful transfer of policies between considerably different domains, without extra supervision such as additional demonstrations, and in situations where other methods fail
Long-Term Visual Object Tracking Benchmark
We propose a new long video dataset (called Track Long and Prosper - TLP) and
benchmark for single object tracking. The dataset consists of 50 HD videos from
real world scenarios, encompassing a duration of over 400 minutes (676K
frames), making it more than 20 folds larger in average duration per sequence
and more than 8 folds larger in terms of total covered duration, as compared to
existing generic datasets for visual tracking. The proposed dataset paves a way
to suitably assess long term tracking performance and train better deep
learning architectures (avoiding/reducing augmentation, which may not reflect
real world behaviour). We benchmark the dataset on 17 state of the art trackers
and rank them according to tracking accuracy and run time speeds. We further
present thorough qualitative and quantitative evaluation highlighting the
importance of long term aspect of tracking. Our most interesting observations
are (a) existing short sequence benchmarks fail to bring out the inherent
differences in tracking algorithms which widen up while tracking on long
sequences and (b) the accuracy of trackers abruptly drops on challenging long
sequences, suggesting the potential need of research efforts in the direction
of long-term tracking.Comment: ACCV 2018 (Oral
Long-term Tracking in the Wild: A Benchmark
We introduce the OxUvA dataset and benchmark for evaluating single-object
tracking algorithms. Benchmarks have enabled great strides in the field of
object tracking by defining standardized evaluations on large sets of diverse
videos. However, these works have focused exclusively on sequences that are
just tens of seconds in length and in which the target is always visible.
Consequently, most researchers have designed methods tailored to this
"short-term" scenario, which is poorly representative of practitioners' needs.
Aiming to address this disparity, we compile a long-term, large-scale tracking
dataset of sequences with average length greater than two minutes and with
frequent target object disappearance. The OxUvA dataset is much larger than the
object tracking datasets of recent years: it comprises 366 sequences spanning
14 hours of video. We assess the performance of several algorithms, considering
both the ability to locate the target and to determine whether it is present or
absent. Our goal is to offer the community a large and diverse benchmark to
enable the design and evaluation of tracking methods ready to be used "in the
wild". The project website is http://oxuva.netComment: To appear at ECCV 201
Siamese network based features fusion for adaptive visual tracking
© Springer Nature Switzerland AG 2018. Visual object tracking is a popular but challenging problem in computer vision. The main challenge is the lack of priori knowledge of the tracking target, which may be only supervised of a bounding box given in the first frame. Besides, the tracking suffers from many influences as scale variations, deformations, partial occlusions and motion blur, etc. To solve such a challenging problem, a suitable tracking framework is demanded to adopt different tracking scenes. This paper presents a novel approach for robust visual object tracking by multiple features fusion in the Siamese Network. Hand-crafted appearance features and CNN features are combined to mutually compensate for their shortages and enhance the advantages. The proposed network is processed as follows. Firstly, different features are extracted from the tracking frames. Secondly, the extracted features are employed via Correlation Filter respectively to learn corresponding templates, which are used to generate response maps respectively. And finally, the multiple response maps are fused to get a better response map, which can help to locate the target location more accurately. Comprehensive experiments are conducted on three benchmarks: Temple-Color, OTB50 and UAV123. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance on these benchmarks
Measuring the Accuracy of Object Detectors and Trackers
The accuracy of object detectors and trackers is most commonly evaluated by
the Intersection over Union (IoU) criterion. To date, most approaches are
restricted to axis-aligned or oriented boxes and, as a consequence, many
datasets are only labeled with boxes. Nevertheless, axis-aligned or oriented
boxes cannot accurately capture an object's shape. To address this, a number of
densely segmented datasets has started to emerge in both the object detection
and the object tracking communities. However, evaluating the accuracy of object
detectors and trackers that are restricted to boxes on densely segmented data
is not straightforward. To close this gap, we introduce the relative
Intersection over Union (rIoU) accuracy measure. The measure normalizes the IoU
with the optimal box for the segmentation to generate an accuracy measure that
ranges between 0 and 1 and allows a more precise measurement of accuracies.
Furthermore, it enables an efficient and easy way to understand scenes and the
strengths and weaknesses of an object detection or tracking approach. We
display how the new measure can be efficiently calculated and present an
easy-to-use evaluation framework. The framework is tested on the DAVIS and the
VOT2016 segmentations and has been made available to the community.Comment: 10 pages, 7 Figure
3D Hand Movement Measurement Framework for Studying Human-Computer Interaction
In order to develop better touch and gesture user interfaces, it is important to be able to measure how humans move their hands while interacting with technical devices. The recent advances in high-speed imaging technology and in image-based object tracking techniques have made it possible to accurately measure the hand movement from videos without the need for data gloves or other sensors that would limit the natural hand movements. In this paper, we propose a complete framework to measure hand movements in 3D in human-computer interaction situations. The framework includes the composition of the measurement setup, selecting the object tracking methods, post-processing of the motion trajectories, 3D trajectory reconstruction, and characterizing and visualizing the movement data. We demonstrate the framework in a context where 3D touch screen usability is studied with 3D stimuli.Peer reviewe
Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers
This paper improves state-of-the-art visual object trackers that use online
adaptation. Our core contribution is an offline meta-learning-based method to
adjust the initial deep networks used in online adaptation-based tracking. The
meta learning is driven by the goal of deep networks that can quickly be
adapted to robustly model a particular target in future frames. Ideally the
resulting models focus on features that are useful for future frames, and avoid
overfitting to background clutter, small parts of the target, or noise. By
enforcing a small number of update iterations during meta-learning, the
resulting networks train significantly faster. We demonstrate this approach on
top of the high performance tracking approaches: tracking-by-detection based
MDNet and the correlation based CREST. Experimental results on standard
benchmarks, OTB2015 and VOT2016, show that our meta-learned versions of both
trackers improve speed, accuracy, and robustness.Comment: Code: https://github.com/silverbottlep/meta_tracker
- …