942 research outputs found
Object Tracking by Reconstruction with View-Specific Discriminative Correlation Filters
Standard RGB-D trackers treat the target as an inherently 2D structure, which
makes modelling appearance changes related even to simple out-of-plane rotation
highly challenging. We address this limitation by proposing a novel long-term
RGB-D tracker - Object Tracking by Reconstruction (OTR). The tracker performs
online 3D target reconstruction to facilitate robust learning of a set of
view-specific discriminative correlation filters (DCFs). The 3D reconstruction
supports two performance-enhancing features: (i) generation of accurate spatial
support for constrained DCF learning from its 2D projection and (ii) point
cloud based estimation of 3D pose change for selection and storage of
view-specific DCFs which are used to robustly localize the target after
out-of-view rotation or heavy occlusion. Extensive evaluation of OTR on the
challenging Princeton RGB-D tracking and STC Benchmarks shows it outperforms
the state-of-the-art by a large margin
Utilization and experimental evaluation of occlusion aware kernel correlation filter tracker using RGB-D
Unlike deep-learning which requires large training datasets, correlation filter-based trackers like Kernelized Correlation Filter (KCF) uses implicit properties of tracked images (circulant matrices) for training in real-time. Despite their practical application in tracking, a need for a better understanding of the fundamentals associated with KCF in terms of theoretically, mathematically, and experimentally exists. This thesis first details the workings prototype of the tracker and investigates its effectiveness in real-time applications and supporting visualizations. We further address some of the drawbacks of the tracker in cases of occlusions, scale changes, object rotation, out-of-view and model drift with our novel RGB-D Kernel Correlation tracker. We also study the use of particle filter to improve trackers\u27 accuracy. Our results are experimentally evaluated using a) standard dataset and b) real-time using Microsoft Kinect V2 sensor. We believe this work will set the basis for better understanding the effectiveness of kernel-based correlation filter trackers and to further define some of its possible advantages in tracking
Visual Prompt Multi-Modal Tracking
Visible-modal object tracking gives rise to a series of downstream
multi-modal tracking tributaries. To inherit the powerful representations of
the foundation model, a natural modus operandi for multi-modal tracking is full
fine-tuning on the RGB-based parameters. Albeit effective, this manner is not
optimal due to the scarcity of downstream data and poor transferability, etc.
In this paper, inspired by the recent success of the prompt learning in
language models, we develop Visual Prompt multi-modal Tracking (ViPT), which
learns the modal-relevant prompts to adapt the frozen pre-trained foundation
model to various downstream multimodal tracking tasks. ViPT finds a better way
to stimulate the knowledge of the RGB-based model that is pre-trained at scale,
meanwhile only introducing a few trainable parameters (less than 1% of model
parameters). ViPT outperforms the full fine-tuning paradigm on multiple
downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event
tracking. Extensive experiments show the potential of visual prompt learning
for multi-modal tracking, and ViPT can achieve state-of-the-art performance
while satisfying parameter efficiency. Code and models are available at
https://github.com/jiawen-zhu/ViPT.Comment: Accepted by CVPR202
DAL: A Deep Depth-Aware Long-term Tracker
The best RGBD trackers provide high accuracy but are slow to run. On the other hand, the best RGB trackers are fast but clearly inferior on the RGBD datasets. In this work, we propose a deep depth-aware long-term tracker that achieves state-of-the-art RGBD tracking performance and is fast to run. We reformulate deep discriminative correlation filter (DCF) to embed the depth information into deep features. Moreover, the same depth-aware correlation filter is used for target redetection. Comprehensive evaluations show that the proposed tracker achieves state-of-the-art performance on the Princeton RGBD, STC, and the newly-released CDTB benchmarks and runs 20 fps.acceptedVersionPeer reviewe
Vision and Depth Based Computerized Anthropometry and Object Tracking
The thesis has two interconnected parts: Computerized Anthropometry and RGBD (RGB plus Depth) object tracking. In the first part of this thesis, we start from the mathematical representation of the human body shape model. It briefly introduces prior arts from the classic human body models to the latest deep neural network based approaches. We describe the performance metrics and popular datasets for evaluating computerized anthropometry estimation algorithms in a unified setting. The first part of this thesis is about describing our contribution over two aspects of human body anthropometry research: 1) a statistical method for estimating anthropometric measurements from scans, and 2) a deep neural network based solution for learning anthropometric measurements from binary silhouettes. We also release two body shape datasets for accommodating data driven learning methods.
In the second part of this thesis, we explore RGBD object tracking. We start from the current state of RGBD tracking compared to RGB tracking and briefly introduce prior arts from engineered features based methods to deep neural network based methods. We present three deep learning based methods that integrate deep depth features into RGBD object tracking. We also release a unified RGBD tracking benchmark for data driven RGBD tracking algorithms. Finally, we explore RGBD tracking with deep depth features and demonstrate that depth cues significantly benefit the target model learning
SALSA: A Novel Dataset for Multimodal Group Behavior Analysis
Studying free-standing conversational groups (FCGs) in unstructured social
settings (e.g., cocktail party ) is gratifying due to the wealth of information
available at the group (mining social networks) and individual (recognizing
native behavioral and personality traits) levels. However, analyzing social
scenes involving FCGs is also highly challenging due to the difficulty in
extracting behavioral cues such as target locations, their speaking activity
and head/body pose due to crowdedness and presence of extreme occlusions. To
this end, we propose SALSA, a novel dataset facilitating multimodal and
Synergetic sociAL Scene Analysis, and make two main contributions to research
on automated social interaction analysis: (1) SALSA records social interactions
among 18 participants in a natural, indoor environment for over 60 minutes,
under the poster presentation and cocktail party contexts presenting
difficulties in the form of low-resolution images, lighting variations,
numerous occlusions, reverberations and interfering sound sources; (2) To
alleviate these problems we facilitate multimodal analysis by recording the
social interplay using four static surveillance cameras and sociometric badges
worn by each participant, comprising the microphone, accelerometer, bluetooth
and infrared sensors. In addition to raw data, we also provide annotations
concerning individuals' personality as well as their position, head, body
orientation and F-formation information over the entire event duration. Through
extensive experiments with state-of-the-art approaches, we show (a) the
limitations of current methods and (b) how the recorded multiple cues
synergetically aid automatic analysis of social interactions. SALSA is
available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure
- …