335 research outputs found
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey
Deep learning has recently achieved very promising results in a wide range of
areas such as computer vision, speech recognition and natural language
processing. It aims to learn hierarchical representations of data by using deep
architecture models. In a smart city, a lot of data (e.g. videos captured from
many distributed sensors) need to be automatically processed and analyzed. In
this paper, we review the deep learning algorithms applied to video analytics
of smart city in terms of different research topics: object detection, object
tracking, face recognition, image classification and scene labeling.Comment: 8 pages, 18 figure
PaMM: Pose-aware Multi-shot Matching for Improving Person Re-identification
Person re-identification is the problem of recognizing people across
different images or videos with non-overlapping views. Although there has been
much progress in person re-identification over the last decade, it remains a
challenging task because appearances of people can seem extremely different
across diverse camera viewpoints and person poses. In this paper, we propose a
novel framework for person re-identification by analyzing camera viewpoints and
person poses in a so-called Pose-aware Multi-shot Matching (PaMM), which
robustly estimates people's poses and efficiently conducts multi-shot matching
based on pose information. Experimental results using public person
re-identification datasets show that the proposed methods outperform
state-of-the-art methods and are promising for person re-identification from
diverse viewpoints and pose variances.Comment: 12 pages, 12 figures, 4 table
Fast detection of multiple objects in traffic scenes with a common detection framework
Traffic scene perception (TSP) aims to real-time extract accurate on-road
environment information, which in- volves three phases: detection of objects of
interest, recognition of detected objects, and tracking of objects in motion.
Since recognition and tracking often rely on the results from detection, the
ability to detect objects of interest effectively plays a crucial role in TSP.
In this paper, we focus on three important classes of objects: traffic signs,
cars, and cyclists. We propose to detect all the three important objects in a
single learning based detection framework. The proposed framework consists of a
dense feature extractor and detectors of three important classes. Once the
dense features have been extracted, these features are shared with all
detectors. The advantage of using one common framework is that the detection
speed is much faster, since all dense features need only to be evaluated once
in the testing phase. In contrast, most previous works have designed specific
detectors using different features for each of these objects. To enhance the
feature robustness to noises and image deformations, we introduce spatially
pooled features as a part of aggregated channel features. In order to further
improve the generalization performance, we propose an object subcategorization
method as a means of capturing intra-class variation of objects. We
experimentally demonstrate the effectiveness and efficiency of the proposed
framework in three detection applications: traffic sign detection, car
detection, and cyclist detection. The proposed framework achieves the
competitive performance with state-of- the-art approaches on several benchmark
datasets.Comment: Appearing in IEEE Transactions on Intelligent Transportation System
Modeling and Inferring Human Intents and Latent Functional Objects for Trajectory Prediction
This paper is about detecting functional objects and inferring human
intentions in surveillance videos of public spaces. People in the videos are
expected to intentionally take shortest paths toward functional objects subject
to obstacles, where people can satisfy certain needs (e.g., a vending machine
can quench thirst), by following one of three possible intent behaviors: reach
a single functional object and stop, or sequentially visit several functional
objects, or initially start moving toward one goal but then change the intent
to move toward another. Since detecting functional objects in low-resolution
surveillance videos is typically unreliable, we call them "dark matter"
characterized by the functionality to attract people. We formulate the
Agent-based Lagrangian Mechanics wherein human trajectories are
probabilistically modeled as motions of agents in many layers of "dark-energy"
fields, where each agent can select a particular force field to affect its
motions, and thus define the minimum-energy Dijkstra path toward the
corresponding source "dark matter". For evaluation, we compiled and annotated a
new dataset. The results demonstrate our effectiveness in predicting human
intent behaviors and trajectories, and localizing functional objects, as well
as discovering distinct functional classes of objects by clustering human
motion behavior in the vicinity of functional objects
Online Multiple Pedestrian Tracking using Deep Temporal Appearance Matching Association
In online multiple pedestrian tracking, it is of great importance to model
appearance and geometric similarity between existing tracks and targets
appeared in a new frame. The appearance model contains discriminative
information with higher dimension compared to the geometric model. Thanks to
the recent success of deep learning based methods, handling of high dimensional
appearance information becomes possible. Among many deep networks, the Siamese
network with triplet loss is popularly adopted as an appearance feature
extractor. Since the Siamese network can extract features of each input
independently, it is possible to update and maintain target-specific features.
However, it is not suitable for multi-object settings that require comparison
with other inputs. In this paper we propose a novel track appearance model
based on joint-inference network to address this issue. The proposed method
enables comparison of two inputs to be used for adaptive appearance modeling.
It contributes to disambiguating the process of target-observation matching and
consolidating the identity consistency. Diverse experimental results support
effectiveness of our method. Our work has been awarded as a 3rd-highest tracker
on MOTChallenge19, held in CVPR2019.Comment: 23 pages, 14 figures, 3rd Prize on 4th BMTT MOTChallenge Workshop
held in CVPR201
Online Metric-Weighted Linear Representations for Robust Visual Tracking
In this paper, we propose a visual tracker based on a metric-weighted linear
representation of appearance. In order to capture the interdependence of
different feature dimensions, we develop two online distance metric learning
methods using proximity comparison information and structured output learning.
The learned metric is then incorporated into a linear representation of
appearance.
We show that online distance metric learning significantly improves the
robustness of the tracker, especially on those sequences exhibiting drastic
appearance changes. In order to bound growth in the number of training samples,
we design a time-weighted reservoir sampling method.
Moreover, we enable our tracker to automatically perform object
identification during the process of object tracking, by introducing a
collection of static template samples belonging to several object classes of
interest. Object identification results for an entire video sequence are
achieved by systematically combining the tracking information and visual
recognition at each frame. Experimental results on challenging video sequences
demonstrate the effectiveness of the method for both inter-frame tracking and
object identification.Comment: 51 pages. Appearing in IEEE Transactions on Pattern Analysis and
Machine Intelligenc
Exploring Uncertainty in Conditional Multi-Modal Retrieval Systems
We cast visual retrieval as a regression problem by posing triplet loss as a
regression loss. This enables epistemic uncertainty estimation using dropout as
a Bayesian approximation framework in retrieval. Accordingly, Monte Carlo (MC)
sampling is leveraged to boost retrieval performance. Our approach is evaluated
on two applications: person re-identification and autonomous car driving.
Comparable state-of-the-art results are achieved on multiple datasets for the
former application.
We leverage the Honda driving dataset (HDD) for autonomous car driving
application. It provides multiple modalities and similarity notions for
ego-motion action understanding. Hence, we present a multi-modal conditional
retrieval network. It disentangles embeddings into separate representations to
encode different similarities. This form of joint learning eliminates the need
to train multiple independent networks without any performance degradation.
Quantitative evaluation highlights our approach competence, achieving 6%
improvement in a highly uncertain environment
Learning Non-Uniform Hypergraph for Multi-Object Tracking
The majority of Multi-Object Tracking (MOT) algorithms based on the
tracking-by-detection scheme do not use higher order dependencies among objects
or tracklets, which makes them less effective in handling complex scenarios. In
this work, we present a new near-online MOT algorithm based on non-uniform
hypergraph, which can model different degrees of dependencies among tracklets
in a unified objective. The nodes in the hypergraph correspond to the tracklets
and the hyperedges with different degrees encode various kinds of dependencies
among them. Specifically, instead of setting the weights of hyperedges with
different degrees empirically, they are learned automatically using the
structural support vector machine algorithm (SSVM). Several experiments are
carried out on various challenging datasets (i.e., PETS09, ParkingLot sequence,
SubwayFace, and MOT16 benchmark), to demonstrate that our method achieves
favorable performance against the state-of-the-art MOT methods.Comment: 11 pages, 4 figures, accepted by AAAI 201
Temporally Robust Global Motion Compensation by Keypoint-based Congealing
Global motion compensation (GMC) removes the impact of camera motion and
creates a video in which the background appears static over the progression of
time. Various vision problems, such as human activity recognition, background
reconstruction, and multi-object tracking can benefit from GMC. Existing GMC
algorithms rely on sequentially processing consecutive frames, by estimating
the transformation mapping the two frames, and obtaining a composite
transformation to a global motion compensated coordinate. Sequential GMC
suffers from temporal drift of frames from the accurate global coordinate, due
to either error accumulation or sporadic failures of motion estimation at a few
frames. We propose a temporally robust global motion compensation (TRGMC)
algorithm which performs accurate and stable GMC, despite complicated and
long-term camera motion. TRGMC densely connects pairs of frames, by matching
local keypoints of each frame. A joint alignment of these frames is formulated
as a novel keypoint-based congealing problem, where the transformation of each
frame is updated iteratively, such that the spatial coordinates for the start
and end points of matched keypoints are identical. Experimental results
demonstrate that TRGMC has superior performance in a wide range of scenarios.Comment: 14 Page
- …