32,425 research outputs found
What Can Help Pedestrian Detection?
Aggregating extra features has been considered as an effective approach to
boost traditional pedestrian detection methods. However, there is still a lack
of studies on whether and how CNN-based pedestrian detectors can benefit from
these extra features. The first contribution of this paper is exploring this
issue by aggregating extra features into CNN-based pedestrian detection
framework. Through extensive experiments, we evaluate the effects of different
kinds of extra features quantitatively. Moreover, we propose a novel network
architecture, namely HyperLearner, to jointly learn pedestrian detection as
well as the given extra feature. By multi-task training, HyperLearner is able
to utilize the information of given features and improve detection performance
without extra inputs in inference. The experimental results on multiple
pedestrian benchmarks validate the effectiveness of the proposed HyperLearner.Comment: Accepted to IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR) 201
DeepBox: Learning Objectness with Convolutional Networks
Existing object proposal approaches use primarily bottom-up cues to rank
proposals, while we believe that objectness is in fact a high level construct.
We argue for a data-driven, semantic approach for ranking object proposals. Our
framework, which we call DeepBox, uses convolutional neural networks (CNNs) to
rerank proposals from a bottom-up method. We use a novel four-layer CNN
architecture that is as good as much larger networks on the task of evaluating
objectness while being much faster. We show that DeepBox significantly improves
over the bottom-up ranking, achieving the same recall with 500 proposals as
achieved by bottom-up methods with 2000. This improvement generalizes to
categories the CNN has never seen before and leads to a 4.5-point gain in
detection mAP. Our implementation achieves this performance while running at
260 ms per image.Comment: ICCV 2015 Camera-ready versio
Deep Shape Matching
We cast shape matching as metric learning with convolutional networks. We
break the end-to-end process of image representation into two parts. Firstly,
well established efficient methods are chosen to turn the images into edge
maps. Secondly, the network is trained with edge maps of landmark images, which
are automatically obtained by a structure-from-motion pipeline. The learned
representation is evaluated on a range of different tasks, providing
improvements on challenging cases of domain generalization, generic
sketch-based image retrieval or its fine-grained counterpart. In contrast to
other methods that learn a different model per task, object category, or
domain, we use the same network throughout all our experiments, achieving
state-of-the-art results in multiple benchmarks.Comment: ECCV 201
Location recognition over large time lags
Would it be possible to automatically associate ancient pictures to modern ones and create fancy cultural heritage city maps? We introduce here the task of recognizing the location depicted in an old photo given modern annotated images collected from the Internet. We present an extensive analysis on different features, looking for the most discriminative and most robust to the image variability induced by large time lags. Moreover, we show that the described task benefits from domain adaptation
Context-Dependent Diffusion Network for Visual Relationship Detection
Visual relationship detection can bridge the gap between computer vision and
natural language for scene understanding of images. Different from pure object
recognition tasks, the relation triplets of subject-predicate-object lie on an
extreme diversity space, such as \textit{person-behind-person} and
\textit{car-behind-building}, while suffering from the problem of combinatorial
explosion. In this paper, we propose a context-dependent diffusion network
(CDDN) framework to deal with visual relationship detection. To capture the
interactions of different object instances, two types of graphs, word semantic
graph and visual scene graph, are constructed to encode global context
interdependency. The semantic graph is built through language priors to model
semantic correlations across objects, whilst the visual scene graph defines the
connections of scene objects so as to utilize the surrounding scene
information. For the graph-structured data, we design a diffusion network to
adaptively aggregate information from contexts, which can effectively learn
latent representations of visual relationships and well cater to visual
relationship detection in view of its isomorphic invariance to graphs.
Experiments on two widely-used datasets demonstrate that our proposed method is
more effective and achieves the state-of-the-art performance.Comment: 8 pages, 3 figures, 2018 ACM Multimedia Conference (MM'18
High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision
Most of the current boundary detection systems rely exclusively on low-level
features, such as color and texture. However, perception studies suggest that
humans employ object-level reasoning when judging if a particular pixel is a
boundary. Inspired by this observation, in this work we show how to predict
boundaries by exploiting object-level features from a pretrained
object-classification network. Our method can be viewed as a "High-for-Low"
approach where high-level object features inform the low-level boundary
detection process. Our model achieves state-of-the-art performance on an
established boundary detection benchmark and it is efficient to run.
Additionally, we show that due to the semantic nature of our boundaries we
can use them to aid a number of high-level vision tasks. We demonstrate that
using our boundaries we improve the performance of state-of-the-art methods on
the problems of semantic boundary labeling, semantic segmentation and object
proposal generation. We can view this process as a "Low-for-High" scheme, where
low-level boundaries aid high-level vision tasks.
Thus, our contributions include a boundary detection system that is accurate,
efficient, generalizes well to multiple datasets, and is also shown to improve
existing state-of-the-art high-level vision methods on three distinct tasks
- …