163,484 research outputs found
LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking
In this paper, we propose a novel effective light-weight framework, called
LightTrack, for online human pose tracking. The proposed framework is designed
to be generic for top-down pose tracking and is faster than existing online and
offline methods. Single-person Pose Tracking (SPT) and Visual Object Tracking
(VOT) are incorporated into one unified functioning entity, easily implemented
by a replaceable single-person pose estimation module. Our framework unifies
single-person pose tracking with multi-person identity association and sheds
first light upon bridging keypoint tracking with object tracking. We also
propose a Siamese Graph Convolution Network (SGCN) for human pose matching as a
Re-ID module in our pose tracking system. In contrary to other Re-ID modules,
we use a graphical representation of human joints for matching. The
skeleton-based representation effectively captures human pose similarity and is
computationally inexpensive. It is robust to sudden camera shift that
introduces human drifting. To the best of our knowledge, this is the first
paper to propose an online human pose tracking framework in a top-down fashion.
The proposed framework is general enough to fit other pose estimators and
candidate matching mechanisms. Our method outperforms other online methods
while maintaining a much higher frame rate, and is very competitive with our
offline state-of-the-art. We make the code publicly available at:
https://github.com/Guanghan/lighttrack.Comment: 9 pages, 6 figures, 6 table
Horizontal-to-Vertical Video Conversion
Alongside the prevalence of mobile videos, the general public leans towards
consuming vertical videos on hand-held devices. To revitalize the exposure of
horizontal contents, we hereby set forth the exploration of automated
horizontal-to-vertical (abbreviated as H2V) video conversion with our proposed
H2V framework, accompanied by an accurately annotated H2V-142K dataset.
Concretely, H2V framework integrates video shot boundary detection, subject
selection and multi-object tracking to facilitate the subject-preserving
conversion, wherein the key is subject selection. To achieve so, we propose a
Rank-SS module that detects human objects, then selects the subject-to-preserve
via exploiting location, appearance, and salient cues. Afterward, the framework
automatically crops the video around the subject to produce vertical contents
from horizontal sources. To build and evaluate our H2V framework, H2V-142K
dataset is densely annotated with subject bounding boxes for 125 videos with
132K frames and 9,500 video covers, upon which we demonstrate superior subject
selection performance comparing to traditional salient approaches, and exhibit
promising horizontal-to-vertical conversion performance overall. By publicizing
this dataset as well as our approach, we wish to pave the way for more valuable
endeavors on the horizontal-to-vertical video conversion task
Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking
Current multi-person localisation and tracking systems have an over reliance
on the use of appearance models for target re-identification and almost no
approaches employ a complete deep learning solution for both objectives. We
present a novel, complete deep learning framework for multi-person localisation
and tracking. In this context we first introduce a light weight sequential
Generative Adversarial Network architecture for person localisation, which
overcomes issues related to occlusions and noisy detections, typically found in
a multi person environment. In the proposed tracking framework we build upon
recent advances in pedestrian trajectory prediction approaches and propose a
novel data association scheme based on predicted trajectories. This removes the
need for computationally expensive person re-identification systems based on
appearance features and generates human like trajectories with minimal
fragmentation. The proposed method is evaluated on multiple public benchmarks
including both static and dynamic cameras and is capable of generating
outstanding performance, especially among other recently proposed deep neural
network based approaches.Comment: To appear in IEEE Winter Conference on Applications of Computer
Vision (WACV), 201
Video Labeling for Automatic Video Surveillance in Security Domains
Beyond traditional security methods, unmanned aerial vehicles (UAVs) have
become an important surveillance tool used in security domains to collect the
required annotated data. However, collecting annotated data from videos taken
by UAVs efficiently, and using these data to build datasets that can be used
for learning payoffs or adversary behaviors in game-theoretic approaches and
security applications, is an under-explored research question. This paper
presents VIOLA, a novel labeling application that includes (i) a workload
distribution framework to efficiently gather human labels from videos in a
secured manner; (ii) a software interface with features designed for labeling
videos taken by UAVs in the domain of wildlife security. We also present the
evolution of VIOLA and analyze how the changes made in the development process
relate to the efficiency of labeling, including when seemingly obvious
improvements did not lead to increased efficiency. VIOLA enables collecting
massive amounts of data with detailed information from challenging security
videos such as those collected aboard UAVs for wildlife security. VIOLA will
lead to the development of new approaches that integrate deep learning for
real-time detection and response.Comment: Presented at the Data For Good Exchange 201
A constrained DMPs framework for robot skills learning and generalization from human demonstrations
Dynamical movement primitives (DMPs) model is a useful tool for efficiently robotic learning manipulation skills from human demonstrations and then generalizing these skills to fulfill new tasks. It is improved and applied for the cases with multiple constraints such as having obstacles or relative distance limitation for multi-agent formation. However, the improved DMPs should change additional terms according to the specified constraints of different tasks. In this paper, we will propose a novel DMPs framework facing the constrained conditions for robotic skills generalization. First, we conclude the common characteristics of previous modified DMPs with constraints and propose a general DMPs framework with various classified constraints. Inspired by barrier Lyapunov functions (BLFs), an additional acceleration term of the general model is deduced to compensate tracking errors between the real and desired trajectories with constraints. Furthermore, we prove convergence of the generated path and makes a discussion about advantages of the proposed method compared with existing literature. Finally, we instantiate the novel framework through three experiments: obstacle avoidance in the static and dynamic environment and human-like cooperative manipulation, to certify its effectiveness
Multiple Object Tracking: A Literature Review
Multiple Object Tracking (MOT) is an important computer vision problem which
has gained increasing attention due to its academic and commercial potential.
Although different kinds of approaches have been proposed to tackle this
problem, it still remains challenging due to factors like abrupt appearance
changes and severe object occlusions. In this work, we contribute the first
comprehensive and most recent review on this problem. We inspect the recent
advances in various aspects and propose some interesting directions for future
research. To the best of our knowledge, there has not been any extensive review
on this topic in the community. We endeavor to provide a thorough review on the
development of this problem in recent decades. The main contributions of this
review are fourfold: 1) Key aspects in a multiple object tracking system,
including formulation, categorization, key principles, evaluation of an MOT are
discussed. 2) Instead of enumerating individual works, we discuss existing
approaches according to various aspects, in each of which methods are divided
into different groups and each group is discussed in detail for the principles,
advances and drawbacks. 3) We examine experiments of existing publications and
summarize results on popular datasets to provide quantitative comparisons. We
also point to some interesting discoveries by analyzing these results. 4) We
provide a discussion about issues of MOT research, as well as some interesting
directions which could possibly become potential research effort in the future
Teacher-Student Framework Enhanced Multi-domain Dialogue Generation
Dialogue systems dealing with multi-domain tasks are highly required. How to
record the state remains a key problem in a task-oriented dialogue system.
Normally we use human-defined features as dialogue states and apply a state
tracker to extract these features. However, the performance of such a system is
limited by the error propagation of a state tracker. In this paper, we propose
a dialogue generation model that needs no external state trackers and still
benefits from human-labeled semantic data. By using a teacher-student
framework, several teacher models are firstly trained in their individual
domains, learn dialogue policies from labeled states. And then the learned
knowledge and experience are merged and transferred to a universal student
model, which takes raw utterance as its input. Experiments show that the
dialogue system trained under our framework outperforms the one uses a belief
tracker.Comment: Official Version: arXiv:2005.1045
Temporal Dynamic Appearance Modeling for Online Multi-Person Tracking
Robust online multi-person tracking requires the correct associations of
online detection responses with existing trajectories. We address this problem
by developing a novel appearance modeling approach to provide accurate
appearance affinities to guide data association. In contrast to most existing
algorithms that only consider the spatial structure of human appearances, we
exploit the temporal dynamic characteristics within temporal appearance
sequences to discriminate different persons. The temporal dynamic makes a
sufficient complement to the spatial structure of varying appearances in the
feature space, which significantly improves the affinity measurement between
trajectories and detections. We propose a feature selection algorithm to
describe the appearance variations with mid-level semantic features, and
demonstrate its usefulness in terms of temporal dynamic appearance modeling.
Moreover, the appearance model is learned incrementally by alternatively
evaluating newly-observed appearances and adjusting the model parameters to be
suitable for online tracking. Reliable tracking of multiple persons in complex
scenes is achieved by incorporating the learned model into an online
tracking-by-detection framework. Our experiments on the challenging benchmark
MOTChallenge 2015 demonstrate that our method outperforms the state-of-the-art
multi-person tracking algorithms
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
A Collaborative Computer Aided Diagnosis (C-CAD) System with Eye-Tracking, Sparse Attentional Model, and Deep Learning
There are at least two categories of errors in radiology screening that can
lead to suboptimal diagnostic decisions and interventions:(i)human fallibility
and (ii)complexity of visual search. Computer aided diagnostic (CAD) tools are
developed to help radiologists to compensate for some of these errors. However,
despite their significant improvements over conventional screening strategies,
most CAD systems do not go beyond their use as second opinion tools due to
producing a high number of false positives, which human interpreters need to
correct. In parallel with efforts in computerized analysis of radiology scans,
several researchers have examined behaviors of radiologists while screening
medical images to better understand how and why they miss tumors, how they
interact with the information in an image, and how they search for unknown
pathology in the images. Eye-tracking tools have been instrumental in exploring
answers to these fundamental questions. In this paper, we aim to develop a
paradigm shift CAD system, called collaborative CAD (C-CAD), that unifies both
of the above mentioned research lines: CAD and eye-tracking. We design an
eye-tracking interface providing radiologists with a real radiology reading
room experience. Then, we propose a novel algorithm that unifies eye-tracking
data and a CAD system. Specifically, we present a new graph based clustering
and sparsification algorithm to transform eye-tracking data (gaze) into a
signal model to interpret gaze patterns quantitatively and qualitatively. The
proposed C-CAD collaborates with radiologists via eye-tracking technology and
helps them to improve diagnostic decisions. The C-CAD learns radiologists'
search efficiency by processing their gaze patterns. To do this, the C-CAD uses
a deep learning algorithm in a newly designed multi-task learning platform to
segment and diagnose cancers simultaneously.Comment: Submitted to Medical Image Analysis Journal (MedIA
- …