24,665 research outputs found
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Person Search in Videos with One Portrait Through Visual and Temporal Links
In real-world applications, e.g. law enforcement and video retrieval, one
often needs to search a certain person in long videos with just one portrait.
This is much more challenging than the conventional settings for person
re-identification, as the search may need to be carried out in the environments
different from where the portrait was taken. In this paper, we aim to tackle
this challenge and propose a novel framework, which takes into account the
identity invariance along a tracklet, thus allowing person identities to be
propagated via both the visual and the temporal links. We also develop a novel
scheme called Progressive Propagation via Competitive Consensus, which
significantly improves the reliability of the propagation process. To promote
the study of person search, we construct a large-scale benchmark, which
contains 127K manually annotated tracklets from 192 movies. Experiments show
that our approach remarkably outperforms mainstream person re-id methods,
raising the mAP from 42.16% to 62.27%.Comment: European Conference on Computer Vision (ECCV), 201
A New Unified Method for Detecting Text from Marathon Runners and Sports Players in Video
Detecting text located on the torsos of marathon runners and sports players
in video is a challenging issue due to poor quality and adverse effects caused
by flexible/colorful clothing, and different structures of human bodies or
actions. This paper presents a new unified method for tackling the above
challenges. The proposed method fuses gradient magnitude and direction
coherence of text pixels in a new way for detecting candidate regions.
Candidate regions are used for determining the number of temporal frame
clusters obtained by K-means clustering on frame differences. This process in
turn detects key frames. The proposed method explores Bayesian probability for
skin portions using color values at both pixel and component levels of temporal
frames, which provides fused images with skin components. Based on skin
information, the proposed method then detects faces and torsos by finding
structural and spatial coherences between them. We further propose adaptive
pixels linking a deep learning model for text detection from torso regions. The
proposed method is tested on our own dataset collected from marathon/sports
video and three standard datasets, namely, RBNR, MMM and R-ID of marathon
images, to evaluate the performance. In addition, the proposed method is also
tested on the standard natural scene datasets, namely, CTW1500 and MS-COCO text
datasets, to show the objectiveness of the proposed method. A comparative study
with the state-of-the-art methods on bib number/text detection of different
datasets shows that the proposed method outperforms the existing methods.Comment: Accepted in Pattern Recognition, Elsevie
Self-Supervised Learning of Face Representations for Video Face Clustering
Analyzing the story behind TV series and movies often requires understanding
who the characters are and what they are doing. With improving deep face
models, this may seem like a solved problem. However, as face detectors get
better, clustering/identification needs to be revisited to address increasing
diversity in facial appearance. In this paper, we address video face clustering
using unsupervised methods. Our emphasis is on distilling the essential
information, identity, from the representations obtained using deep pre-trained
face networks. We propose a self-supervised Siamese network that can be trained
without the need for video/track based supervision, and thus can also be
applied to image collections. We evaluate our proposed method on three video
face clustering datasets. The experiments show that our methods outperform
current state-of-the-art methods on all datasets. Video face clustering is
lacking a common benchmark as current works are often evaluated with different
metrics and/or different sets of face tracks.Comment: To appear at International Conference on Automatic Face and Gesture
Recognition (2019) as an Oral. The datasets and code are available at
https://github.com/vivoutlaw/SSIA
FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces
With recent advances in computer vision and graphics, it is now possible to
generate videos with extremely realistic synthetic faces, even in real time.
Countless applications are possible, some of which raise a legitimate alarm,
calling for reliable detectors of fake videos. In fact, distinguishing between
original and manipulated video can be a challenge for humans and computers
alike, especially when the videos are compressed or have low resolution, as it
often happens on social networks. Research on the detection of face
manipulations has been seriously hampered by the lack of adequate datasets. To
this end, we introduce a novel face manipulation dataset of about half a
million edited images (from over 1000 videos). The manipulations have been
generated with a state-of-the-art face editing approach. It exceeds all
existing video manipulation datasets by at least an order of magnitude. Using
our new dataset, we introduce benchmarks for classical image forensic tasks,
including classification and segmentation, considering videos compressed at
various quality levels. In addition, we introduce a benchmark evaluation for
creating indistinguishable forgeries with known ground truth; for instance with
generative refinement models.Comment: Video: https://youtu.be/Tle7YaPkO_
Vehicle Re-Identification in Context
Existing vehicle re-identification (re-id) evaluation benchmarks consider
strongly artificial test scenarios by assuming the availability of high quality
images and fine-grained appearance at an almost constant image scale,
reminiscent to images required for Automatic Number Plate Recognition, e.g.
VeRi-776. Such assumptions are often invalid in realistic vehicle re-id
scenarios where arbitrarily changing image resolutions (scales) are the norm.
This makes the existing vehicle re-id benchmarks limited for testing the true
performance of a re-id method. In this work, we introduce a more realistic and
challenging vehicle re-id benchmark, called Vehicle Re-Identification in
Context (VRIC). In contrast to existing datasets, VRIC is uniquely
characterised by vehicle images subject to more realistic and unconstrained
variations in resolution (scale), motion blur, illumination, occlusion, and
viewpoint. It contains 60,430 images of 5,622 vehicle identities captured by 60
different cameras at heterogeneous road traffic scenes in both day-time and
night-time.Comment: Dataset available at: http://qmul-vric.github.io. To appear on German
Conference on Pattern Recognition (GCPR) 201
Temporal Action Detection by Joint Identification-Verification
Temporal action detection aims at not only recognizing action category but
also detecting start time and end time for each action instance in an untrimmed
video. The key challenge of this task is to accurately classify the action and
determine the temporal boundaries of each action instance. In temporal action
detection benchmark: THUMOS 2014, large variations exist in the same action
category while many similarities exist in different action categories, which
always limit the performance of temporal action detection. To address this
problem, we propose to use joint Identification-Verification network to reduce
the intra-action variations and enlarge inter-action differences. The joint
Identification-Verification network is a siamese network based on 3D ConvNets,
which can simultaneously predict the action categories and the similarity
scores for the input pairs of video proposal segments. Extensive experimental
results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of
our proposed method compared to the existing state-of-art methods for temporal
action detection in untrimmed videos
A Richly Annotated Dataset for Pedestrian Attribute Recognition
In this paper, we aim to improve the dataset foundation for pedestrian
attribute recognition in real surveillance scenarios. Recognition of human
attributes, such as gender, and clothes types, has great prospects in real
applications. However, the development of suitable benchmark datasets for
attribute recognition remains lagged behind. Existing human attribute datasets
are collected from various sources or an integration of pedestrian
re-identification datasets. Such heterogeneous collection poses a big challenge
on developing high quality fine-grained attribute recognition algorithms.
Furthermore, human attribute recognition are generally severely affected by
environmental or contextual factors, such as viewpoints, occlusions and body
parts, while existing attribute datasets barely care about them. To tackle
these problems, we build a Richly Annotated Pedestrian (RAP) dataset from real
multi-camera surveillance scenarios with long term collection, where data
samples are annotated with not only fine-grained human attributes but also
environmental and contextual factors. RAP has in total 41,585 pedestrian
samples, each of which is annotated with 72 attributes as well as viewpoints,
occlusions, body parts information. To our knowledge, the RAP dataset is the
largest pedestrian attribute dataset, which is expected to greatly promote the
study of large-scale attribute recognition systems. Furthermore, we empirically
analyze the effects of different environmental and contextual factors on
pedestrian attribute recognition. Experimental results demonstrate that
viewpoints, occlusions and body parts information could assist attribute
recognition a lot in real applications.Comment: 16 pages, 8 figure
- …