287 research outputs found
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
SOT for MOT
In this paper we present a robust tracker to solve the multiple object
tracking (MOT) problem, under the framework of tracking-by-detection. As the
first contribution, we innovatively combine single object tracking (SOT)
algorithms with multiple object tracking algorithms, and our results show that
SOT is a general way to strongly reduce the number of false negatives,
regardless of the quality of detection. Another contribution is that we show
with a deep learning based appearance model, it is easy to associate detections
of the same object efficiently and also with high accuracy. This appearance
model plays an important role in our MOT algorithm to correctly associate
detections into long trajectories, and also in our SOT algorithm to discover
new detections mistakenly missed by the detector. The deep neural network based
model ensures the robustness of our tracking algorithm, which can perform data
association in a wide variety of scenes. We ran comprehensive experiments on a
large-scale and challenging dataset, the MOT16 benchmark, and results showed
that our tracker achieved state-of-the-art performance based on both public and
private detections
An Automatic System for Unconstrained Video-Based Face Recognition
Although deep learning approaches have achieved performance surpassing humans
for still image-based face recognition, unconstrained video-based face
recognition is still a challenging task due to large volume of data to be
processed and intra/inter-video variations on pose, illumination, occlusion,
scene, blur, video quality, etc. In this work, we consider challenging
scenarios for unconstrained video-based face recognition from multiple-shot
videos and surveillance videos with low-quality frames. To handle these
problems, we propose a robust and efficient system for unconstrained
video-based face recognition, which is composed of modules for face/fiducial
detection, face association, and face recognition. First, we use multi-scale
single-shot face detectors to efficiently localize faces in videos. The
detected faces are then grouped respectively through carefully designed face
association methods, especially for multi-shot videos. Finally, the faces are
recognized by the proposed face matcher based on an unsupervised subspace
learning approach and a subspace-to-subspace similarity metric. Extensive
experiments on challenging video datasets, such as Multiple Biometric Grand
Challenge (MBGC), Face and Ocular Challenge Series (FOCS), IARPA Janus
Surveillance Video Benchmark (IJB-S) for low-quality surveillance videos and
IARPA JANUS Benchmark B (IJB-B) for multiple-shot videos, demonstrate that the
proposed system can accurately detect and associate faces from unconstrained
videos and effectively learn robust and discriminative features for
recognition
Multi-Face Tracking by Extended Bag-of-Tracklets in Egocentric Videos
Wearable cameras offer a hands-free way to record egocentric images of daily
experiences, where social events are of special interest. The first step
towards detection of social events is to track the appearance of multiple
persons involved in it. In this paper, we propose a novel method to find
correspondences of multiple faces in low temporal resolution egocentric videos
acquired through a wearable camera. This kind of photo-stream imposes
additional challenges to the multi-tracking problem with respect to
conventional videos. Due to the free motion of the camera and to its low
temporal resolution, abrupt changes in the field of view, in illumination
condition and in the target location are highly frequent. To overcome such
difficulties, we propose a multi-face tracking method that generates a set of
tracklets through finding correspondences along the whole sequence for each
detected face and takes advantage of the tracklets redundancy to deal with
unreliable ones. Similar tracklets are grouped into the so called extended
bag-of-tracklets (eBoT), which is aimed to correspond to a specific person.
Finally, a prototype tracklet is extracted for each eBoT, where the occurred
occlusions are estimated by relying on a new measure of confidence. We
validated our approach over an extensive dataset of egocentric photo-streams
and compared it to state of the art methods, demonstrating its effectiveness
and robustness.Comment: 27 pages, 18 figures, submitted to computer vision and image
understanding journa
Spatial-Temporal Relation Networks for Multi-Object Tracking
Recent progress in multiple object tracking (MOT) has shown that a robust
similarity score is key to the success of trackers. A good similarity score is
expected to reflect multiple cues, e.g. appearance, location, and topology,
over a long period of time. However, these cues are heterogeneous, making them
hard to be combined in a unified network. As a result, existing methods usually
encode them in separate networks or require a complex training approach. In
this paper, we present a unified framework for similarity measurement which
could simultaneously encode various cues and perform reasoning across both
spatial and temporal domains. We also study the feature representation of a
tracklet-object pair in depth, showing a proper design of the pair features can
well empower the trackers. The resulting approach is named spatial-temporal
relation networks (STRN). It runs in a feed-forward way and can be trained in
an end-to-end manner. The state-of-the-art accuracy was achieved on all of the
MOT15-17 benchmarks using public detection and online settings
Real-Time Visual Tracking and Identification for a Team of Homogeneous Humanoid Robots
The use of a team of humanoid robots to collaborate in completing a task is
an increasingly important field of research. One of the challenges in achieving
collaboration, is mutual identification and tracking of the robots. This work
presents a real-time vision-based approach to the detection and tracking of
robots of known appearance, based on the images captured by a stationary robot.
A Histogram of Oriented Gradients descriptor is used to detect the robots and
the robot headings are estimated by a multiclass classifier. The tracked robots
report their own heading estimate from magnetometer readings. For tracking, a
cost function based on position and heading is applied to each of the
tracklets, and a globally optimal labeling of the detected robots is found
using the Hungarian algorithm. The complete identification and tracking system
was tested using two igus Humanoid Open Platform robots on a soccer field. We
expect that a similar system can be used with other humanoid robots, such as
Nao and DARwIn-OPComment: 20th RoboCup International Symposium, Leipzig, Germany, 201
An End-to-End System for Crowdsourced 3d Maps for Autonomous Vehicles: The Mapping Component
Autonomous vehicles rely on precise high definition (HD) 3d maps for
navigation. This paper presents the mapping component of an end-to-end system
for crowdsourcing precise 3d maps with semantically meaningful landmarks such
as traffic signs (6 dof pose, shape and size) and traffic lanes (3d splines).
The system uses consumer grade parts, and in particular, relies on a single
front facing camera and a consumer grade GPS. Using real-time sign and lane
triangulation on-device in the vehicle, with offline sign/lane clustering
across multiple journeys and offline Bundle Adjustment across multiple journeys
in the backend, we construct maps with mean absolute accuracy at sign corners
of less than 20 cm from 25 journeys. To the best of our knowledge, this is the
first end-to-end HD mapping pipeline in global coordinates in the automotive
context using cost effective sensors
Self-supervised Multi-view Person Association and Its Applications
Reliable markerless motion tracking of people participating in a complex
group activity from multiple moving cameras is challenging due to frequent
occlusions, strong viewpoint and appearance variations, and asynchronous video
streams. To solve this problem, reliable association of the same person across
distant viewpoints and temporal instances is essential. We present a
self-supervised framework to adapt a generic person appearance descriptor to
the unlabeled videos by exploiting motion tracking, mutual exclusion
constraints, and multi-view geometry. The adapted discriminative descriptor is
used in a tracking-by-clustering formulation. We validate the effectiveness of
our descriptor learning on WILDTRACK [14] and three new complex social scenes
captured by multiple cameras with up to 60 people "in the wild". We report
significant improvement in association accuracy (up to 18%) and stable and
coherent 3D human skeleton tracking (5 to 10 times) over the baseline. Using
the reconstructed 3D skeletons, we cut the input videos into a multi-angle
video where the image of a specified person is shown from the best visible
front-facing camera. Our algorithm detects inter-human occlusion to determine
the camera switching moment while still maintaining the flow of the action
well.Comment: Accepted to IEEE TPAM
End-to-end Face Detection and Cast Grouping in Movies Using Erd\H{o}s-R\'{e}nyi Clustering
We present an end-to-end system for detecting and clustering faces by
identity in full-length movies. Unlike works that start with a predefined set
of detected faces, we consider the end-to-end problem of detection and
clustering together. We make three separate contributions. First, we combine a
state-of-the-art face detector with a generic tracker to extract high quality
face tracklets. We then introduce a novel clustering method, motivated by the
classic graph theory results of Erd\H{o}s and R\'enyi. It is based on the
observations that large clusters can be fully connected by joining just a small
fraction of their point pairs, while just a single connection between two
different people can lead to poor clustering results. This suggests clustering
using a verification system with very few false positives but perhaps moderate
recall. We introduce a novel verification method, rank-1 counts verification,
that has this property, and use it in a link-based clustering scheme. Finally,
we define a novel end-to-end detection and clustering evaluation metric
allowing us to assess the accuracy of the entire end-to-end system. We present
state-of-the-art results on multiple video data sets and also on standard face
databases.Comment: to appear in ICCV 2017 (spotlight
- …