10,589 research outputs found
Discriminatively Learned Hierarchical Rank Pooling Networks
In this work, we present novel temporal encoding methods for action and
activity classification by extending the unsupervised rank pooling temporal
encoding method in two ways. First, we present "discriminative rank pooling" in
which the shared weights of our video representation and the parameters of the
action classifiers are estimated jointly for a given training dataset of
labelled vector sequences using a bilevel optimization formulation of the
learning problem. When the frame level features vectors are obtained from a
convolutional neural network (CNN), we rank pool the network activations and
jointly estimate all parameters of the model, including CNN filters and
fully-connected weights, in an end-to-end manner which we coined as "end-to-end
trainable rank pooled CNN". Importantly, this model can make use of any
existing convolutional neural network architecture (e.g., AlexNet or VGG)
without modification or introduction of additional parameters. Then, we extend
rank pooling to a high capacity video representation, called "hierarchical rank
pooling". Hierarchical rank pooling consists of a network of rank pooling
functions, which encode temporal semantics over arbitrary long video clips
based on rich frame level features. By stacking non-linear feature functions
and temporal sub-sequence encoders one on top of the other, we build a high
capacity encoding network of the dynamic behaviour of the video. The resulting
video representation is a fixed-length feature vector describing the entire
video clip that can be used as input to standard machine learning classifiers.
We demonstrate our approach on the task of action and activity recognition.
Obtained results are comparable to state-of-the-art methods on three important
activity recognition benchmarks with classification performance of 76.7% mAP on
Hollywood2, 69.4% on HMDB51, and 93.6% on UCF101.Comment: International Journal of Computer Visio
Deep-Person: Learning Discriminative Deep Features for Person Re-Identification
Recently, many methods of person re-identification (Re-ID) rely on part-based
feature representation to learn a discriminative pedestrian descriptor.
However, the spatial context between these parts is ignored for the independent
extractor to each separate part. In this paper, we propose to apply Long
Short-Term Memory (LSTM) in an end-to-end way to model the pedestrian, seen as
a sequence of body parts from head to foot. Integrating the contextual
information strengthens the discriminative ability of local representation. We
also leverage the complementary information between local and global feature.
Furthermore, we integrate both identification task and ranking task in one
network, where a discriminative embedding and a similarity measurement are
learned concurrently. This results in a novel three-branch framework named
Deep-Person, which learns highly discriminative features for person Re-ID.
Experimental results demonstrate that Deep-Person outperforms the
state-of-the-art methods by a large margin on three challenging datasets
including Market-1501, CUHK03, and DukeMTMC-reID. Specifically, combining with
a re-ranking approach, we achieve a 90.84% mAP on Market-1501 under single
query setting.Comment: Accepted to Pattern Recognition. The code is released:
https://github.com/zydou/Deep-Perso
Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings
Image-to-video person re-identification identifies a target person by a probe
image from quantities of pedestrian videos captured by non-overlapping cameras.
Despite the great progress achieved,it's still challenging to match in the
multimodal scenario,i.e. between image and video. Currently,state-of-the-art
approaches mainly focus on the task-specific data,neglecting the extra
information on the different but related tasks. In this paper,we propose an
end-to-end neural network framework for image-to-video person reidentification
by leveraging cross-modal embeddings learned from extra information.Concretely
speaking,cross-modal embeddings from image captioning and video captioning
models are reused to help learned features be projected into a coordinated
space,where similarity can be directly computed. Besides,training steps from
fixed model reuse approach are integrated into our framework,which can
incorporate beneficial information and eventually make the target networks
independent of existing models. Apart from that,our proposed framework resorts
to CNNs and LSTMs for extracting visual and spatiotemporal features,and
combines the strengths of identification and verification model to improve the
discriminative ability of the learned feature. The experimental results
demonstrate the effectiveness of our framework on narrowing down the gap
between heterogeneous data and obtaining observable improvement in
image-to-video person re-identification.Comment: under review for Pattern Recognition Letter
Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation
We conduct a large-scale, systematic study to evaluate the existing
evaluation methods for natural language generation in the context of generating
online product reviews. We compare human-based evaluators with a variety of
automated evaluation procedures, including discriminative evaluators that
measure how well machine-generated text can be distinguished from human-written
text, as well as word overlap metrics that assess how similar the generated
text compares to human-written references. We determine to what extent these
different evaluators agree on the ranking of a dozen of state-of-the-art
generators for online product reviews. We find that human evaluators do not
correlate well with discriminative evaluators, leaving a bigger question of
whether adversarial accuracy is the correct objective for natural language
generation. In general, distinguishing machine-generated text is challenging
even for human evaluators, and human decisions correlate better with lexical
overlaps. We find lexical diversity an intriguing metric that is indicative of
the assessments of different evaluators. A post-experiment survey of
participants provides insights into how to evaluate and improve the quality of
natural language generation systems
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
A Siamese Long Short-Term Memory Architecture for Human Re-Identification
Matching pedestrians across multiple camera views known as human
re-identification (re-identification) is a challenging problem in visual
surveillance. In the existing works concentrating on feature extraction,
representations are formed locally and independent of other regions. We present
a novel siamese Long Short-Term Memory (LSTM) architecture that can process
image regions sequentially and enhance the discriminative capability of local
feature representation by leveraging contextual information. The feedback
connections and internal gating mechanism of the LSTM cells enable our model to
memorize the spatial dependencies and selectively propagate relevant contextual
information through the network. We demonstrate improved performance compared
to the baseline algorithm with no LSTM units and promising results compared to
state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets.
Visualization of the internal mechanism of LSTM cells shows meaningful patterns
can be learned by our method
Self Attention Grid for Person Re-Identification
In this paper, we present an attention mechanism scheme to improve person
re-identification task. Inspired by biology, we propose Self Attention Grid
(SAG) to discover the most informative parts from a high-resolution image using
its internal representation. In particular, given an input image, the proposed
model is fed with two copies of the same image and consists of two branches.
The upper branch processes the high-resolution image and learns high
dimensional feature representation while the lower branch processes the
low-resolution image and learn a filtering attention grid. We apply a max
filter operation to non-overlapping sub-regions on the high feature
representation before element-wise multiplied with the output of the second
branch. The feature maps of the second branch are subsequently weighted to
reflect the importance of each patch of the grid using a softmax operation. Our
attention module helps the network learn the most discriminative visual
features of multiple image regions and is specifically optimized to attend
feature representation at different levels. Extensive experiments on three
large-scale datasets show that our self-attention mechanism significantly
improves the baseline model and outperforms various state-of-art models by a
large margin.Comment: 10 pages, 4 figures, under revie
Intra-clip Aggregation for Video Person Re-identification
Video-based person re-identification has drawn massive attention in recent
years due to its extensive applications in video surveillance. While deep
learning-based methods have led to significant progress, these methods are
limited by ineffectively using complementary information, which is blamed on
necessary data augmentation in the training process. Data augmentation has been
widely used to mitigate the over-fitting trap and improve the ability of
network representation. However, the previous methods adopt image-based data
augmentation scheme to individually process the input frames, which corrupts
the complementary information between consecutive frames and causes performance
degradation. Extensive experiments on three benchmark datasets demonstrate that
our framework outperforms the most recent state-of-the-art methods. We also
perform cross-dataset validation to prove the generality of our method.Comment: Due to the privacy issue of person re-ID, we require to withdraw the
previous version of this pape
Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach
In this paper, we present an end-to-end approach to simultaneously learn
spatio-temporal features and corresponding similarity metric for video-based
person re-identification. Given the video sequence of a person, features from
each frame that are extracted from all levels of a deep convolutional network
can preserve a higher spatial resolution from which we can model finer motion
patterns. These low-level visual percepts are leveraged into a variant of
recurrent model to characterize the temporal variation between time-steps.
Features from all time-steps are then summarized using temporal pooling to
produce an overall feature representation for the complete sequence. The deep
convolutional network, recurrent layer, and the temporal pooling are jointly
trained to extract comparable hidden-unit representations from input pair of
time series to compute their corresponding similarity value. The proposed
framework combines time series modeling and metric learning to jointly learn
relevant features and a good similarity measure between time sequences of
person.
Experiments demonstrate that our approach achieves the state-of-the-art
performance for video-based person re-identification on iLIDS-VID and PRID
2011, the two primary public datasets for this purpose.Comment: 11 page
Deep Co-attention based Comparators For Relative Representation Learning in Person Re-identification
Person re-identification (re-ID) requires rapid, flexible yet discriminant
representations to quickly generalize to unseen observations on-the-fly and
recognize the same identity across disjoint camera views. Recent effective
methods are developed in a pair-wise similarity learning system to detect a
fixed set of features from distinct regions which are mapped to their vector
embeddings for the distance measuring. However, the most relevant and crucial
parts of each image are detected independently without referring to the
dependency conditioned on one and another. Also, these region based methods
rely on spatial manipulation to position the local features in comparable
similarity measuring. To combat these limitations, in this paper we introduce
the Deep Co-attention based Comparators (DCCs) that fuse the co-dependent
representations of the paired images so as to focus on the relevant parts of
both images and produce their \textit{relative representations}. Given a pair
of pedestrian images to be compared, the proposed model mimics the foveation of
human eyes to detect distinct regions concurrent on both images, namely
co-dependent features, and alternatively attend to relevant regions to fuse
them into the similarity learning. Our comparator is capable of producing
dynamic representations relative to a particular sample every time, and thus
well-suited to the case of re-identifying pedestrians on-the-fly. We perform
extensive experiments to provide the insights and demonstrate the effectiveness
of the proposed DCCs in person re-ID. Moreover, our approach has achieved the
state-of-the-art performance on three benchmark data sets: DukeMTMC-reID
\cite{DukeMTMC}, CUHK03 \cite{FPNN}, and Market-1501 \cite{Market1501}
- …