4,822 research outputs found
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Tracking Persons-of-Interest via Unsupervised Representation Adaptation
Multi-face tracking in unconstrained videos is a challenging problem as faces
of one person often appear drastically different in multiple shots due to
significant variations in scale, pose, expression, illumination, and make-up.
Existing multi-target tracking methods often use low-level features which are
not sufficiently discriminative for identifying faces with such large
appearance variations. In this paper, we tackle this problem by learning
discriminative, video-specific face representations using convolutional neural
networks (CNNs). Unlike existing CNN-based approaches which are only trained on
large-scale face image datasets offline, we use the contextual constraints to
generate a large number of training samples for a given video, and further
adapt the pre-trained face CNN to specific videos using discovered training
samples. Using these training samples, we optimize the embedding space so that
the Euclidean distances correspond to a measure of semantic face similarity via
minimizing a triplet loss function. With the learned discriminative features,
we apply the hierarchical clustering algorithm to link tracklets across
multiple shots to generate trajectories. We extensively evaluate the proposed
algorithm on two sets of TV sitcoms and YouTube music videos, analyze the
contribution of each component, and demonstrate significant performance
improvement over existing techniques.Comment: Project page: http://vllab1.ucmerced.edu/~szhang/FaceTracking
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Urban traffic optimization using traffic cameras as sensors is driving the
need to advance state-of-the-art multi-target multi-camera (MTMC) tracking.
This work introduces CityFlow, a city-scale traffic camera dataset consisting
of more than 3 hours of synchronized HD videos from 40 cameras across 10
intersections, with the longest distance between two simultaneous cameras being
2.5 km. To the best of our knowledge, CityFlow is the largest-scale dataset in
terms of spatial coverage and the number of cameras/videos in an urban
environment. The dataset contains more than 200K annotated bounding boxes
covering a wide range of scenes, viewing angles, vehicle models, and urban
traffic flow conditions. Camera geometry and calibration information are
provided to aid spatio-temporal analysis. In addition, a subset of the
benchmark is made available for the task of image-based vehicle
re-identification (ReID). We conducted an extensive experimental evaluation of
baselines/state-of-the-art approaches in MTMC tracking, multi-target
single-camera (MTSC) tracking, object detection, and image-based ReID on this
dataset, analyzing the impact of different network architectures, loss
functions, spatio-temporal models and their combinations on task effectiveness.
An evaluation server is launched with the release of our benchmark at the 2019
AI City Challenge (https://www.aicitychallenge.org/) that allows researchers to
compare the performance of their newest techniques. We expect this dataset to
catalyze research in this field, propel the state-of-the-art forward, and lead
to deployed traffic optimization(s) in the real world.Comment: Accepted for oral presentation at CVPR 2019 with review ratings of 2
strong accepts and 1 accept (work done during an internship at NVIDIA
Fast Spatio-Temporal Residual Network for Video Super-Resolution
Recently, deep learning based video super-resolution (SR) methods have
achieved promising performance. To simultaneously exploit the spatial and
temporal information of videos, employing 3-dimensional (3D) convolutions is a
natural approach. However, straight utilizing 3D convolutions may lead to an
excessively high computational complexity which restricts the depth of video SR
models and thus undermine the performance. In this paper, we present a novel
fast spatio-temporal residual network (FSTRN) to adopt 3D convolutions for the
video SR task in order to enhance the performance while maintaining a low
computational load. Specifically, we propose a fast spatio-temporal residual
block (FRB) that divide each 3D filter to the product of two 3D filters, which
have considerably lower dimensions. Furthermore, we design a cross-space
residual learning that directly links the low-resolution space and the
high-resolution space, which can greatly relieve the computational burden on
the feature fusion and up-scaling parts. Extensive evaluations and comparisons
on benchmark datasets validate the strengths of the proposed approach and
demonstrate that the proposed network significantly outperforms the current
state-of-the-art methods.Comment: To appear in CVPR 201
Person Re-identification in Appearance Impaired Scenarios
Person re-identification is critical in surveillance applications. Current
approaches rely on appearance based features extracted from a single or
multiple shots of the target and candidate matches. These approaches are at a
disadvantage when trying to distinguish between candidates dressed in similar
colors or when targets change their clothing. In this paper we propose a
dynamics-based feature to overcome this limitation. The main idea is to capture
soft biometrics from gait and motion patterns by gathering dense short
trajectories (tracklets) which are Fisher vector encoded. To illustrate the
merits of the proposed features we introduce three new "appearance-impaired"
datasets. Our experiments on the original and the appearance impaired datasets
demonstrate the benefits of incorporating dynamics-based information with
appearance-based information to re-identification algorithms.Comment: 10 page
Distance-based Camera Network Topology Inference for Person Re-identification
In this paper, we propose a novel distance-based camera network topology
inference method for efficient person re-identification. To this end, we first
calibrate each camera and estimate relative scales between cameras. Using the
calibration results of multiple cameras, we calculate the speed of each person
and infer the distance between cameras to generate distance-based camera
network topology. The proposed distance-based topology can be applied
adaptively to each person according to its speed and handle diverse transition
time of people between non-overlapping cameras. To validate the proposed
method, we tested the proposed method using an open person re-identification
dataset and compared to state-of-the-art methods. The experimental results show
that the proposed method is effective for person re-identification in the
large-scale camera network with various people transition time.Comment: 10 pages, 11 figure
A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
Person re-identification (re-id) is a critical problem in video analytics
applications such as security and surveillance. The public release of several
datasets and code for vision algorithms has facilitated rapid progress in this
area over the last few years. However, directly comparing re-id algorithms
reported in the literature has become difficult since a wide variety of
features, experimental protocols, and evaluation metrics are employed. In order
to address this need, we present an extensive review and performance evaluation
of single- and multi-shot re-id algorithms. The experimental protocol
incorporates the most recent advances in both feature extraction and metric
learning. To ensure a fair comparison, all of the approaches were implemented
using a unified code library that includes 11 feature extraction algorithms and
22 metric learning and ranking techniques. All approaches were evaluated using
a new large-scale dataset that closely mimics a real-world problem setting, in
addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR,
DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03,
RAiD, iLIDSVID, HDA+ and Market1501. The evaluation codebase and results will
be made publicly available for community use.Comment: Preliminary work on person Re-Id benchmark. S. Karanam and M. Gou
contributed equally. 14 pages, 6 figures, 4 tables. For supplementary
material, see
http://robustsystems.coe.neu.edu/sites/robustsystems.coe.neu.edu/files/systems/supmat/ReID_benchmark_supp.zi
Action Machine: Rethinking Action Recognition in Trimmed Videos
Existing methods in video action recognition mostly do not distinguish human
body from the environment and easily overfit the scenes and objects. In this
work, we present a conceptually simple, general and high-performance framework
for action recognition in trimmed videos, aiming at person-centric modeling.
The method, called Action Machine, takes as inputs the videos cropped by person
bounding boxes. It extends the Inflated 3D ConvNet (I3D) by adding a branch for
human pose estimation and a 2D CNN for pose-based action recognition, being
fast to train and test. Action Machine can benefit from the multi-task training
of action recognition and pose estimation, the fusion of predictions from RGB
images and poses. On NTU RGB-D, Action Machine achieves the state-of-the-art
performance with top-1 accuracies of 97.2% and 94.3% on cross-view and
cross-subject respectively. Action Machine also achieves competitive
performance on another three smaller action recognition datasets: Northwestern
UCLA Multiview Action3D, MSR Daily Activity3D and UTD-MHAD. Code will be made
available
Wavelet Video Coding Algorithm Based on Energy Weighted Significance Probability Balancing Tree
This work presents a 3-D wavelet video coding algorithm. By analyzing the
contribution of each biorthogonal wavelet basis to reconstructed signal's
energy, we weight each wavelet subband according to its basis energy. Based on
distribution of weighted coefficients, we further discuss a 3-D wavelet tree
structure named \textbf{significance probability balancing tree}, which places
the coefficients with similar probabilities of being significant on the same
layer. It is implemented by using hybrid spatial orientation tree and
temporal-domain block tree. Subsequently, a novel 3-D wavelet video coding
algorithm is proposed based on the energy-weighted significance probability
balancing tree. Experimental results illustrate that our algorithm always
achieves good reconstruction quality for different classes of video sequences.
Compared with asymmetric 3-D orientation tree, the average peak signal-to-noise
ratio (PSNR) gain of our algorithm are 1.24dB, 2.54dB and 2.57dB for luminance
(Y) and chrominance (U,V) components, respectively. Compared with
temporal-spatial orientation tree algorithm, our algorithm gains 0.38dB, 2.92dB
and 2.39dB higher PSNR separately for Y, U, and V components. In addition, the
proposed algorithm requires lower computation cost than those of the above two
algorithms.Comment: 17 pages, 2 figures, submission to Multimedia Tools and Application
Survey on RGB, 3D, Thermal, and Multimodal Approaches for Facial Expression Recognition: History, Trends, and Affect-related Applications
Facial expressions are an important way through which humans interact
socially. Building a system capable of automatically recognizing facial
expressions from images and video has been an intense field of study in recent
years. Interpreting such expressions remains challenging and much research is
needed about the way they relate to human affect. This paper presents a general
overview of automatic RGB, 3D, thermal and multimodal facial expression
analysis. We define a new taxonomy for the field, encompassing all steps from
face detection to facial expression recognition, and describe and classify the
state of the art methods accordingly. We also present the important datasets
and the bench-marking of most influential methods. We conclude with a general
discussion about trends, important questions and future lines of research
- …