2,751 research outputs found
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Vehicle-Rear: A New Dataset to Explore Feature Fusion for Vehicle Identification Using Convolutional Neural Networks
This work addresses the problem of vehicle identification through
non-overlapping cameras. As our main contribution, we introduce a novel dataset
for vehicle identification, called Vehicle-Rear, that contains more than three
hours of high-resolution videos, with accurate information about the make,
model, color and year of nearly 3,000 vehicles, in addition to the position and
identification of their license plates. To explore our dataset we design a
two-stream CNN that simultaneously uses two of the most distinctive and
persistent features available: the vehicle's appearance and its license plate.
This is an attempt to tackle a major problem: false alarms caused by vehicles
with similar designs or by very close license plate identifiers. In the first
network stream, shape similarities are identified by a Siamese CNN that uses a
pair of low-resolution vehicle patches recorded by two different cameras. In
the second stream, we use a CNN for OCR to extract textual information,
confidence scores, and string similarities from a pair of high-resolution
license plate patches. Then, features from both streams are merged by a
sequence of fully connected layers for decision. In our experiments, we
compared the two-stream network against several well-known CNN architectures
using single or multiple vehicle features. The architectures, trained models,
and dataset are publicly available at https://github.com/icarofua/vehicle-rear
Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking
Current multi-person localisation and tracking systems have an over reliance
on the use of appearance models for target re-identification and almost no
approaches employ a complete deep learning solution for both objectives. We
present a novel, complete deep learning framework for multi-person localisation
and tracking. In this context we first introduce a light weight sequential
Generative Adversarial Network architecture for person localisation, which
overcomes issues related to occlusions and noisy detections, typically found in
a multi person environment. In the proposed tracking framework we build upon
recent advances in pedestrian trajectory prediction approaches and propose a
novel data association scheme based on predicted trajectories. This removes the
need for computationally expensive person re-identification systems based on
appearance features and generates human like trajectories with minimal
fragmentation. The proposed method is evaluated on multiple public benchmarks
including both static and dynamic cameras and is capable of generating
outstanding performance, especially among other recently proposed deep neural
network based approaches.Comment: To appear in IEEE Winter Conference on Applications of Computer
Vision (WACV), 201
Real-time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification
Online multi-object tracking is a fundamental problem in time-critical video
analysis applications. A major challenge in the popular tracking-by-detection
framework is how to associate unreliable detection results with existing
tracks. In this paper, we propose to handle unreliable detection by collecting
candidates from outputs of both detection and tracking. The intuition behind
generating redundant candidates is that detection and tracks can complement
each other in different scenarios. Detection results of high confidence prevent
tracking drifts in the long term, and predictions of tracks can handle noisy
detection caused by occlusion. In order to apply optimal selection from a
considerable amount of candidates in real-time, we present a novel scoring
function based on a fully convolutional neural network, that shares most
computations on the entire image. Moreover, we adopt a deeply learned
appearance representation, which is trained on large-scale person
re-identification datasets, to improve the identification ability of our
tracker. Extensive experiments show that our tracker achieves real-time and
state-of-the-art performance on a widely used people tracking benchmark.Comment: ICME 201
Person Search with Natural Language Description
Searching persons in large-scale image databases with the query of natural
language description has important applications in video surveillance. Existing
methods mainly focused on searching persons with image-based or attribute-based
queries, which have major limitations for a practical usage. In this paper, we
study the problem of person search with natural language description. Given the
textual description of a person, the algorithm of the person search is required
to rank all the samples in the person database then retrieve the most relevant
sample corresponding to the queried description. Since there is no person
dataset or benchmark with textual description available, we collect a
large-scale person description dataset with detailed natural language
annotations and person samples from various sources, termed as CUHK Person
Description Dataset (CUHK-PEDES). A wide range of possible models and baselines
have been evaluated and compared on the person search benchmark. An Recurrent
Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to
establish the state-of-the art performance on person search
- …