Search CORE

28 research outputs found

Visual object tracking performance measures revisited

Author: Kristan Matej
Leonardis Aleš
Čehovin Luka
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The problem of visual tracking evaluation is sporting a large variety of performance measures, and largely suffers from lack of consensus about which measures should be used in experiments. This makes the cross-paper tracker comparison difficult. Furthermore, as some measures may be less effective than others, the tracking results may be skewed or biased towards particular tracking aspects. In this paper we revisit the popular performance measures and tracker performance visualizations and analyze them theoretically and experimentally. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis we narrow down the set of potential measures to only two complementary ones, describing accuracy and robustness, thus pushing towards homogenization of the tracker evaluation methodology. These two measures can be intuitively interpreted and visualized and have been employed by the recent Visual Object Tracking (VOT) challenges as the foundation for the evaluation methodology

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Beyond standard benchmarks: Parameterizing performance evaluation in visual object tracking

Author: Kristan Matej
Leonardis Aleš
Lukežič Alan
Zajc Luka Čehovin
Publication venue
Publication date: 25/03/2017
Field of study

Object-to-camera motion produces a variety of apparent motion patterns that significantly affect performance of short-term visual trackers. Despite being crucial for designing robust trackers, their influence is poorly explored in standard benchmarks due to weakly defined, biased and overlapping attribute annotations. In this paper we propose to go beyond pre-recorded benchmarks with post-hoc annotations by presenting an approach that utilizes omnidirectional videos to generate realistic, consistently annotated, short-term tracking scenarios with exactly parameterized motion patterns. We have created an evaluation system, constructed a fully annotated dataset of omnidirectional videos and the generators for typical motion patterns. We provide an in-depth analysis of major tracking paradigms which is complementary to the standard benchmarks and confirms the expressiveness of our evaluation approach

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Understanding and Diagnosing Visual Tracking Systems

Author: Jia Jiaya
Shi Jianping
Wang Naiyan
Yeung Dit-Yan
Publication venue
Publication date: 23/04/2015
Field of study

Several benchmark datasets for visual tracking research have been proposed in recent years. Despite their usefulness, whether they are sufficient for understanding and diagnosing the strengths and weaknesses of different trackers remains questionable. To address this issue, we propose a framework by breaking a tracker down into five constituent parts, namely, motion model, feature extractor, observation model, model updater, and ensemble post-processor. We then conduct ablative experiments on each component to study how it affects the overall result. Surprisingly, our findings are discrepant with some common beliefs in the visual tracking research community. We find that the feature extractor plays the most important role in a tracker. On the other hand, although the observation model is the focus of many studies, we find that it often brings no significant improvement. Moreover, the motion model and model updater contain many details that could affect the result. Also, the ensemble post-processor can improve the result substantially when the constituent trackers have high diversity. Based on our findings, we put together some very elementary building blocks to give a basic tracker which is competitive in performance to the state-of-the-art trackers. We believe our framework can provide a solid baseline when conducting controlled experiments for visual tracking research

arXiv.org e-Print Archive

Crossref

The visual object tracking VOT2016 challenge results

Author: Alatan Aydin
Bastos Guilherme
Battistone Francesco
Bischof Horst
Bunyak Filiz
Chang Chang-Ming
Chen Dapeng
Du Dawei
Erdem Aykut
Erdem Erkut
Felsberg Michael
Fernández Gustavo
Garcia-Martin Alvaro
Ghanem Bernard
Gundogdu Erhan
Gupta Abhinav
Han Bohyung
Henriques João F
Häger Gustav
Khan Fahad
Kim Daijin
Kristan Matej
Leonardis Aleš
Li Hongdong
Liu Bin
Lukežič Alan
Ma Andy J
Martinez Brais
Matas Jiři
Medeiros Henry
Memarmoghadam Alireza
Mishra Deepak
Petrosino Alfredo
Pflugfelder Roman
Porikli Fatih
Possegger Horst
Qi Honggang
Robinson Andreas
Roffo Giorgio
Seetharaman Guna
Solís Montero Andrés
Subrahmanyam Gorthi RK Sai
Sun Chong
Torr Philip HS
Varfolomieiev Anton
Vedaldi Andrea
Vojír̃ Tomáš
Xu Changsheng
Yeung Dit-Yan
Zhao Fei
Zhu Gao
Čehovin Luka
Publication venue: Springer
Publication date: 03/11/2016
Field of study

The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http: //votchallenge.net)

Oxford University Research Archive

Deformable Object Tracking with Gated Fusion

Author: Chen Dengsheng
Hancke Gerhard P.
He Shengfeng
Lau Rynson W. H.
Liu Wenxi
Song Yibing
Yan Tao
Yu Yuanlong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/04/2019
Field of study

The tracking-by-detection framework receives growing attentions through the integration with the Convolutional Neural Networks (CNNs). Existing tracking-by-detection based methods, however, fail to track objects with severe appearance variations. This is because the traditional convolutional operation is performed on fixed grids, and thus may not be able to find the correct response while the object is changing pose or under varying environmental conditions. In this paper, we propose a deformable convolution layer to enrich the target appearance representations in the tracking-by-detection framework. We aim to capture the target appearance variations via deformable convolution, which adaptively enhances its original features. In addition, we also propose a gated fusion scheme to control how the variations captured by the deformable convolution affect the original appearance. The enriched feature representation through deformable convolution facilitates the discrimination of the CNN classifier on the target object and background. Extensive experiments on the standard benchmarks show that the proposed tracker performs favorably against state-of-the-art methods

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

Perception Test:A Diagnostic Benchmark for Multimodal Video Models

Author: Carriera Joao
Damen Dima
Patraucean Viorica
Zisserman Andrew
Publication venue
Publication date: 16/12/2023
Field of study

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 45.8%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_tes

Explore Bristol Research