7,816 research outputs found
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
ReadNet:Towards Accurate ReID with Limited and Noisy Samples
Person re-identification (ReID) is an essential cross-camera retrieval task
to identify pedestrians. However, the photo number of each pedestrian usually
differs drastically, and thus the data limitation and imbalance problem hinders
the prediction accuracy greatly. Additionally, in real-world applications,
pedestrian images are captured by different surveillance cameras, so the noisy
camera related information, such as the lights, perspectives and resolutions,
result in inevitable domain gaps for ReID algorithms. These challenges bring
difficulties to current deep learning methods with triplet loss for coping with
such problems. To address these challenges, this paper proposes ReadNet, an
adversarial camera network (ACN) with an angular triplet loss (ATL). In detail,
ATL focuses on learning the angular distance among different identities to
mitigate the effect of data imbalance, and guarantees a linear decision
boundary as well, while ACN takes the camera discriminator as a game opponent
of feature extractor to filter camera related information to bridge the
multi-camera gaps. ReadNet is designed to be flexible so that either ATL or ACN
can be deployed independently or simultaneously. The experiment results on
various benchmark datasets have shown that ReadNet can deliver better
prediction performance than current state-of-the-art methods
Person Re-identification Using Visual Attention
Despite recent attempts for solving the person re-identification problem, it
remains a challenging task since a person's appearance can vary significantly
when large variations in view angle, human pose, and illumination are involved.
In this paper, we propose a novel approach based on using a gradient-based
attention mechanism in deep convolution neural network for solving the person
re-identification problem. Our model learns to focus selectively on parts of
the input image for which the networks' output is most sensitive to and
processes them with high resolution while perceiving the surrounding image in
low resolution. Extensive comparative evaluations demonstrate that the proposed
method outperforms state-of-the-art approaches on the challenging CUHK01,
CUHK03, and Market 1501 datasets.Comment: Published at IEEE International Conference on Image Processing 201
GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval
The huge variance of human pose and the misalignment of detected human images
significantly increase the difficulty of person Re-Identification (Re-ID).
Moreover, efficient Re-ID systems are required to cope with the massive visual
data being produced by video surveillance systems. Targeting to solve these
problems, this work proposes a Global-Local-Alignment Descriptor (GLAD) and an
efficient indexing and retrieval framework, respectively. GLAD explicitly
leverages the local and global cues in human body to generate a discriminative
and robust representation. It consists of part extraction and descriptor
learning modules, where several part regions are first detected and then deep
neural networks are designed for representation learning on both the local and
global regions. A hierarchical indexing and retrieval framework is designed to
eliminate the huge redundancy in the gallery set, and accelerate the online
Re-ID procedure. Extensive experimental results show GLAD achieves competitive
accuracy compared to the state-of-the-art methods. Our retrieval framework
significantly accelerates the online Re-ID procedure without loss of accuracy.
Therefore, this work has potential to work better on person Re-ID tasks in real
scenarios.Comment: Accepted by ACM MM2017, 9 pages, 5 figure
Towards Storytelling from Visual Lifelogging: An Overview
Visual lifelogging consists of acquiring images that capture the daily
experiences of the user by wearing a camera over a long period of time. The
pictures taken offer considerable potential for knowledge mining concerning how
people live their lives, hence, they open up new opportunities for many
potential applications in fields including healthcare, security, leisure and
the quantified self. However, automatically building a story from a huge
collection of unstructured egocentric data presents major challenges. This
paper provides a thorough review of advances made so far in egocentric data
analysis, and in view of the current state of the art, indicates new lines of
research to move us towards storytelling from visual lifelogging.Comment: 16 pages, 11 figures, Submitted to IEEE Transactions on Human-Machine
System
Towards Automatic Image Editing: Learning to See another You
Learning the distribution of images in order to generate new samples is a
challenging task due to the high dimensionality of the data and the highly
non-linear relations that are involved. Nevertheless, some promising results
have been reported in the literature recently,building on deep network
architectures. In this work, we zoom in on a specific type of image generation:
given an image and knowing the category of objects it belongs to (e.g. faces),
our goal is to generate a similar and plausible image, but with some altered
attributes. This is particularly challenging, as the model needs to learn to
disentangle the effect of each attribute and to apply a desired attribute
change to a given input image, while keeping the other attributes and overall
object appearance intact. To this end, we learn a convolutional network, where
the desired attribute information is encoded then merged with the encoded image
at feature map level. We show promising results, both qualitatively as well as
quantitatively, in the context of a retrieval experiment, on two face datasets
(MultiPie and CAS-PEAL-R1)
A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking
Person re identification is a challenging retrieval task that requires
matching a person's acquired image across non overlapping camera views. In this
paper we propose an effective approach that incorporates both the fine and
coarse pose information of the person to learn a discriminative embedding. In
contrast to the recent direction of explicitly modeling body parts or
correcting for misalignment based on these, we show that a rather
straightforward inclusion of acquired camera view and/or the detected joint
locations into a convolutional neural network helps to learn a very effective
representation. To increase retrieval performance, re-ranking techniques based
on computed distances have recently gained much attention. We propose a new
unsupervised and automatic re-ranking framework that achieves state-of-the-art
re-ranking performance. We show that in contrast to the current
state-of-the-art re-ranking methods our approach does not require to compute
new rank lists for each image pair (e.g., based on reciprocal neighbors) and
performs well by using simple direct rank list based comparison or even by just
using the already computed euclidean distances between the images. We show that
both our learned representation and our re-ranking method achieve
state-of-the-art performance on a number of challenging surveillance image and
video datasets.
The code is available online at:
https://github.com/pse-ecn/pose-sensitive-embeddingComment: CVPR 2018: v2 (fixes, added new results on PRW dataset
Review of Person Re-identification Techniques
Person re-identification across different surveillance cameras with disjoint
fields of view has become one of the most interesting and challenging subjects
in the area of intelligent video surveillance. Although several methods have
been developed and proposed, certain limitations and unresolved issues remain.
In all of the existing re-identification approaches, feature vectors are
extracted from segmented still images or video frames. Different similarity or
dissimilarity measures have been applied to these vectors. Some methods have
used simple constant metrics, whereas others have utilised models to obtain
optimised metrics. Some have created models based on local colour or texture
information, and others have built models based on the gait of people. In
general, the main objective of all these approaches is to achieve a
higher-accuracy rate and lowercomputational costs. This study summarises
several developments in recent literature and discusses the various available
methods used in person re-identification. Specifically, their advantages and
disadvantages are mentioned and compared.Comment: Published 201
Factorized Distillation: Training Holistic Person Re-identification Model by Distilling an Ensemble of Partial ReID Models
Person re-identification (ReID) is aimed at identifying the same person
across videos captured from different cameras. In the view that networks
extracting global features using ordinary network architectures are difficult
to extract local features due to their weak attention mechanisms, researchers
have proposed a lot of elaborately designed ReID networks, while greatly
improving the accuracy, the model size and the feature extraction latency are
also soaring. We argue that a relatively compact ordinary network extracting
globally pooled features has the capability to extract discriminative local
features and can achieve state-of-the-art precision if only the model's
parameters are properly learnt. In order to reduce the difficulty in learning
hard identity labels, we propose a novel knowledge distillation method:
Factorized Distillation, which factorizes both feature maps and retrieval
features of holistic ReID network to mimic representations of multiple partial
ReID models, thus transferring the knowledge from partial ReID models to the
holistic network. Experiments show that the performance of model trained with
the proposed method can outperform state-of-the-art with relatively few network
parameters.Comment: 10 pages, 5 figure
A Dual-Source Approach for 3D Human Pose Estimation from a Single Image
In this work we address the challenging problem of 3D human pose estimation
from single images. Recent approaches learn deep neural networks to regress 3D
pose directly from images. One major challenge for such methods, however, is
the collection of training data. Specifically, collecting large amounts of
training data containing unconstrained images annotated with accurate 3D poses
is infeasible. We therefore propose to use two independent training sources.
The first source consists of accurate 3D motion capture data, and the second
source consists of unconstrained images with annotated 2D poses. To integrate
both sources, we propose a dual-source approach that combines 2D pose
estimation with efficient 3D pose retrieval. To this end, we first convert the
motion capture data into a normalized 2D pose space, and separately learn a 2D
pose estimation model from the image data. During inference, we estimate the 2D
pose and efficiently retrieve the nearest 3D poses. We then jointly estimate a
mapping from the 3D pose space to the image and reconstruct the 3D pose. We
provide a comprehensive evaluation of the proposed method and experimentally
demonstrate the effectiveness of our approach, even when the skeleton
structures of the two sources differ substantially.Comment: under consideration at Computer Vision and Image Understanding.
Extended version of CVPR-2016 paper, arXiv:1509.0672
- …