97,591 research outputs found
Improving Nighttime Retrieval-Based Localization
Outdoor visual localization is a crucial component to many computer vision
systems. We propose an approach to localization from images that is designed to
explicitly handle the strong variations in appearance happening between daytime
and nighttime. As revealed by recent long-term localization benchmarks, both
traditional feature-based and retrieval-based approaches still struggle to
handle such changes. Our novel localization method combines a state-of-the-art
image retrieval architecture with condition-specific sub-networks allowing the
computation of global image descriptors that are explicitly dependent of the
capturing conditions. We show that our approach improves localization by a
factor of almost 300\% compared to the popular VLAD-based methods on nighttime
localization
Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization
We present an approach that combines appearance and semantic information for
2D image-based localization (2D-VL) across large perceptual changes and time
lags. Compared to appearance features, the semantic layout of a scene is
generally more invariant to appearance variations. We use this intuition and
propose a novel end-to-end deep attention-based framework that utilizes
multimodal cues to generate robust embeddings for 2D-VL. The proposed attention
module predicts a shared channel attention and modality-specific spatial
attentions to guide the embeddings to focus on more reliable image regions. We
evaluate our model against state-of-the-art (SOTA) methods on three challenging
localization datasets. We report an average (absolute) improvement of
over current SOTA for 2D-VL. Furthermore, we present an extensive study
demonstrating the contribution of each component of our model, showing
-- and improvement from adding semantic information and our
proposed attention module. We finally show the predicted attention maps to
offer useful insights into our model.Comment: Appearing in BMVC 201
Weighted Bilinear Coding over Salient Body Parts for Person Re-identification
Deep convolutional neural networks (CNNs) have demonstrated dominant
performance in person re-identification (Re-ID). Existing CNN based methods
utilize global average pooling (GAP) to aggregate intermediate convolutional
features for Re-ID. However, this strategy only considers the first-order
statistics of local features and treats local features at different locations
equally important, leading to sub-optimal feature representation. To deal with
these issues, we propose a novel weighted bilinear coding (WBC) framework for
local feature aggregation in CNN networks to pursue more representative and
discriminative feature representations, which can adapt to other
state-of-the-art methods and improve their performance. In specific, bilinear
coding is used to encode the channel-wise feature correlations to capture
richer feature interactions. Meanwhile, a weighting scheme is applied on the
bilinear coding to adaptively adjust the weights of local features at different
locations based on their importance in recognition, further improving the
discriminability of feature aggregation. To handle the spatial misalignment
issue, we use a salient part net (spatial attention module) to derive salient
body parts, and apply the WBC model on each part. The final representation,
formed by concatenating the WBC encoded features of each part, is both
discriminative and resistant to spatial misalignment. Experiments on three
benchmarks including Market-1501, DukeMTMC-reID and CUHK03 evidence the
favorable performance of our method against other outstanding methods.Comment: 22 page
Deep Learning Driven Visual Path Prediction from a Single Image
Capabilities of inference and prediction are significant components of visual
systems. In this paper, we address an important and challenging task of them:
visual path prediction. Its goal is to infer the future path for a visual
object in a static scene. This task is complicated as it needs high-level
semantic understandings of both the scenes and motion patterns underlying video
sequences. In practice, cluttered situations have also raised higher demands on
the effectiveness and robustness of the considered models. Motivated by these
observations, we propose a deep learning framework which simultaneously
performs deep feature learning for visual representation in conjunction with
spatio-temporal context modeling. After that, we propose a unified path
planning scheme to make accurate future path prediction based on the analytic
results of the context models. The highly effective visual representation and
deep context models ensure that our framework makes a deep semantic
understanding of the scene and motion pattern, consequently improving the
performance of the visual path prediction task. In order to comprehensively
evaluate the model's performance on the visual path prediction task, we
construct two large benchmark datasets from the adaptation of video tracking
datasets. The qualitative and quantitative experimental results show that our
approach outperforms the existing approaches and owns a better generalization
capability
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Self-Supervised Learning for Stereo Matching with Self-Improving Ability
Exiting deep-learning based dense stereo matching methods often rely on
ground-truth disparity maps as the training signals, which are however not
always available in many situations. In this paper, we design a simple
convolutional neural network architecture that is able to learn to compute
dense disparity maps directly from the stereo inputs. Training is performed in
an end-to-end fashion without the need of ground-truth disparity maps. The idea
is to use image warping error (instead of disparity-map residuals) as the loss
function to drive the learning process, aiming to find a depth-map that
minimizes the warping error. While this is a simple concept well-known in
stereo matching, to make it work in a deep-learning framework, many non-trivial
challenges must be overcome, and in this work we provide effective solutions.
Our network is self-adaptive to different unseen imageries as well as to
different camera settings. Experiments on KITTI and Middlebury stereo benchmark
datasets show that our method outperforms many state-of-the-art stereo matching
methods with a margin, and at the same time significantly faster.Comment: 13 pages, 11 figure
Group Re-Identification with Multi-grained Matching and Integration
The task of re-identifying groups of people underdifferent camera views is an
important yet less-studied problem.Group re-identification (Re-ID) is a very
challenging task sinceit is not only adversely affected by common issues in
traditionalsingle object Re-ID problems such as viewpoint and human
posevariations, but it also suffers from changes in group layout andgroup
membership. In this paper, we propose a novel conceptof group granularity by
characterizing a group image by multi-grained objects: individual persons and
sub-groups of two andthree people within a group. To achieve robust group
Re-ID,we first introduce multi-grained representations which can beextracted
via the development of two separate schemes, i.e. onewith hand-crafted
descriptors and another with deep neuralnetworks. The proposed representation
seeks to characterize bothappearance and spatial relations of multi-grained
objects, and isfurther equipped with importance weights which capture
varia-tions in intra-group dynamics. Optimal group-wise matching isfacilitated
by a multi-order matching process which in turn,dynamically updates the
importance weights in iterative fashion.We evaluated on three multi-camera
group datasets containingcomplex scenarios and large dynamics, with
experimental resultsdemonstrating the effectiveness of our approach. The
published dataset can be found in
\url{http://min.sjtu.edu.cn/lwydemo/GroupReID.html}Comment: 14 pages, 10 figures, to appear in IEEE transaction on Cybernetic
Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings
Image-to-video person re-identification identifies a target person by a probe
image from quantities of pedestrian videos captured by non-overlapping cameras.
Despite the great progress achieved,it's still challenging to match in the
multimodal scenario,i.e. between image and video. Currently,state-of-the-art
approaches mainly focus on the task-specific data,neglecting the extra
information on the different but related tasks. In this paper,we propose an
end-to-end neural network framework for image-to-video person reidentification
by leveraging cross-modal embeddings learned from extra information.Concretely
speaking,cross-modal embeddings from image captioning and video captioning
models are reused to help learned features be projected into a coordinated
space,where similarity can be directly computed. Besides,training steps from
fixed model reuse approach are integrated into our framework,which can
incorporate beneficial information and eventually make the target networks
independent of existing models. Apart from that,our proposed framework resorts
to CNNs and LSTMs for extracting visual and spatiotemporal features,and
combines the strengths of identification and verification model to improve the
discriminative ability of the learned feature. The experimental results
demonstrate the effectiveness of our framework on narrowing down the gap
between heterogeneous data and obtaining observable improvement in
image-to-video person re-identification.Comment: under review for Pattern Recognition Letter
Orientation Driven Bag of Appearances for Person Re-identification
Person re-identification (re-id) consists of associating individual across
camera network, which is valuable for intelligent video surveillance and has
drawn wide attention. Although person re-identification research is making
progress, it still faces some challenges such as varying poses, illumination
and viewpoints. For feature representation in re-identification, existing works
usually use low-level descriptors which do not take full advantage of body
structure information, resulting in low representation ability.
%discrimination. To solve this problem, this paper proposes the mid-level
body-structure based feature representation (BSFR) which introduces body
structure pyramid for codebook learning and feature pooling in the vertical
direction of human body. Besides, varying viewpoints in the horizontal
direction of human body usually causes the data missing problem, , the
appearances obtained in different orientations of the identical person could
vary significantly. To address this problem, the orientation driven bag of
appearances (ODBoA) is proposed to utilize person orientation information
extracted by orientation estimation technic. To properly evaluate the proposed
approach, we introduce a new re-identification dataset (Market-1203) based on
the Market-1501 dataset and propose a new re-identification dataset (PKU-Reid).
Both datasets contain multiple images captured in different body orientations
for each person. Experimental results on three public datasets and two proposed
datasets demonstrate the superiority of the proposed approach, indicating the
effectiveness of body structure and orientation information for improving
re-identification performance.Comment: 13 pages, 15 figures, 3 tables, submitted to IEEE Transactions on
Circuits and Systems for Video Technolog
- …