2,185 research outputs found
Ensemble of Part Detectors for Simultaneous Classification and Localization
Part-based representation has been proven to be effective for a variety of
visual applications. However, automatic discovery of discriminative parts
without object/part-level annotations is challenging. This paper proposes a
discriminative mid-level representation paradigm based on the responses of a
collection of part detectors, which only requires the image-level labels.
Towards this goal, we first develop a detector-based spectral clustering method
to mine the representative and discriminative mid-level patterns for detector
initialization. The advantage of the proposed pattern mining technology is that
the distance metric based on detectors only focuses on discriminative details,
and a set of such grouped detectors offer an effective way for consistent
pattern mining. Relying on the discovered patterns, we further formulate the
detector learning process as a confidence-loss sparse Multiple Instance
Learning (cls-MIL) task, which considers the diversity of the positive samples,
while avoid drifting away the well localized ones by assigning a confidence
value to each positive sample. The responses of the learned detectors can form
an effective mid-level image representation for both image classification and
object localization. Experiments conducted on benchmark datasets demonstrate
the superiority of our method over existing approaches
Recycle deep features for better object detection
Aiming at improving the performance of existing detection algorithms
developed for different applications, we propose a region regression-based
multi-stage class-agnostic detection pipeline, whereby the existing algorithms
are employed for providing the initial detection proposals. Better detection is
obtained by exploiting the power of deep learning in the region regress scheme
while avoiding the requirement on a huge amount of reference data for training
deep neural networks. Additionally, a novel network architecture with recycled
deep features is proposed, which provides superior regression results compared
to the commonly used architectures. As demonstrated on a data set with ~1200
samples of different classes, it is feasible to successfully train a deep
neural network in our proposed architecture and use it to obtain the desired
detection performance. Since only slight modifications are required to common
network architectures and since the deep neural network is trained using the
standard hyperparameters, the proposed detection is well accessible and can be
easily adopted to a broad variety of detection tasks
Where are the Blobs: Counting by Localization with Point Supervision
Object counting is an important task in computer vision due to its growing
demand in applications such as surveillance, traffic monitoring, and counting
everyday objects. State-of-the-art methods use regression-based optimization
where they explicitly learn to count the objects of interest. These often
perform better than detection-based methods that need to learn the more
difficult task of predicting the location, size, and shape of each object.
However, we propose a detection-based method that does not need to estimate the
size and shape of the objects and that outperforms regression-based methods.
Our contributions are three-fold: (1) we propose a novel loss function that
encourages the network to output a single blob per object instance using
point-level annotations only; (2) we design two methods for splitting large
predicted blobs between object instances; and (3) we show that our method
achieves new state-of-the-art results on several challenging datasets including
the Pascal VOC and the Penguins dataset. Our method even outperforms those that
use stronger supervision such as depth features, multi-point annotations, and
bounding-box labels
E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
An end-to-end trainable (fully differentiable) method for multi-language
scene text localization and recognition is proposed. The approach is based on a
single fully convolutional network (FCN) with shared layers for both tasks.
E2E-MLT is the first published multi-language OCR for scene text. While
trained in multi-language setup, E2E-MLT demonstrates competitive performance
when compared to other methods trained for English scene text alone. The
experiments show that obtaining accurate multi-language multi-script
annotations is a challenging problem
Deep Learning for Generic Object Detection: A Survey
Object detection, one of the most fundamental and challenging problems in
computer vision, seeks to locate object instances from a large number of
predefined categories in natural images. Deep learning techniques have emerged
as a powerful strategy for learning feature representations directly from data
and have led to remarkable breakthroughs in the field of generic object
detection. Given this period of rapid evolution, the goal of this paper is to
provide a comprehensive survey of the recent achievements in this field brought
about by deep learning techniques. More than 300 research contributions are
included in this survey, covering many aspects of generic object detection:
detection frameworks, object feature representation, object proposal
generation, context modeling, training strategies, and evaluation metrics. We
finish the survey by identifying promising directions for future research.Comment: IJCV Mino
Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection
A novel efficient method for extraction of object proposals is introduced.
Its "objectness" function exploits deep spatial pyramid features, a novel
fast-to-compute HoG-based edge statistic and the EdgeBoxes score. The
efficiency is achieved by the use of spatial bins in a novel combination with
sparsity-inducing group normalized SVM. State-of-the-art recall performance is
achieved on Pascal VOC07, significantly outperforming methods with comparable
speed. Interestingly, when only 100 proposals per image are considered the
method attains 78% recall on VOC07. The method improves mAP of the RCNN
state-of-the-art class-specific detector, increasing it by 10 points when only
50 proposals are used in each image. The system trained on twenty classes
performs well on the two hundred class ILSVRC2013 set confirming generalization
capability.Comment: Accepted to ICCV1
Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition
Recognizing multiple labels of images is a fundamental but challenging task
in computer vision, and remarkable progress has been attained by localizing
semantic-aware image regions and predicting their labels with deep
convolutional neural networks. The step of hypothesis regions (region
proposals) localization in these existing multi-label image recognition
pipelines, however, usually takes redundant computation cost, e.g., generating
hundreds of meaningless proposals with non-discriminative information and
extracting their features, and the spatial contextual dependency modeling among
the localized regions are often ignored or over-simplified. To resolve these
issues, this paper proposes a recurrent attention reinforcement learning
framework to iteratively discover a sequence of attentional and informative
regions that are related to different semantic objects and further predict
label scores conditioned on these regions. Besides, our method explicitly
models long-term dependencies among these attentional regions that help to
capture semantic label co-occurrence and thus facilitate multi-label
recognition. Extensive experiments and comparisons on two large-scale
benchmarks (i.e., PASCAL VOC and MS-COCO) show that our model achieves superior
performance over existing state-of-the-art methods in both performance and
efficiency as well as explicitly identifying image-level semantic labels to
specific object regions.Comment: Accepted at AAAI 201
Global-Local Face Upsampling Network
Face hallucination, which is the task of generating a high-resolution face
image from a low-resolution input image, is a well-studied problem that is
useful in widespread application areas. Face hallucination is particularly
challenging when the input face resolution is very low (e.g., 10 x 12 pixels)
and/or the image is captured in an uncontrolled setting with large pose and
illumination variations. In this paper, we revisit the algorithm introduced in
[1] and present a deep interpretation of this framework that achieves
state-of-the-art under such challenging scenarios. In our deep network
architecture the global and local constraints that define a face can be
efficiently modeled and learned end-to-end using training data. Conceptually
our network design can be partitioned into two sub-networks: the first one
implements the holistic face reconstruction according to global constraints,
and the second one enhances face-specific details and enforces local patch
statistics. We optimize the deep network using a new loss function for
super-resolution that combines reconstruction error with a learned face quality
measure in adversarial setting, producing improved visual results. We conduct
extensive experiments in both controlled and uncontrolled setups and show that
our algorithm improves the state of the art both numerically and visually
Enhancing Salient Object Segmentation Through Attention
Segmenting salient objects in an image is an important vision task with
ubiquitous applications. The problem becomes more challenging in the presence
of a cluttered and textured background, low resolution and/or low contrast
images. Even though existing algorithms perform well in segmenting most of the
object(s) of interest, they often end up segmenting false positives due to
resembling salient objects in the background. In this work, we tackle this
problem by iteratively attending to image patches in a recurrent fashion and
subsequently enhancing the predicted segmentation mask. Saliency features are
estimated independently for every image patch, which are further combined using
an aggregation strategy based on a Convolutional Gated Recurrent Unit (ConvGRU)
network. The proposed approach works in an end-to-end manner, removing
background noise and false positives incrementally. Through extensive
evaluation on various benchmark datasets, we show superior performance to the
existing approaches without any post-processing.Comment: CVPRW - Deep Vision 201
Accurate Face Detection for High Performance
Face detection has witnessed significant progress due to the advances of deep
convolutional neural networks (CNNs). Its central issue in recent years is how
to improve the detection performance of tiny faces. To this end, many recent
works propose some specific strategies, redesign the architecture and introduce
new loss functions for tiny object detection. In this report, we start from the
popular one-stage RetinaNet approach and apply some recent tricks to obtain a
high performance face detector. Specifically, we apply the Intersection over
Union (IoU) loss function for regression, employ the two-step classification
and regression for detection, revisit the data augmentation based on
data-anchor-sampling for training, utilize the max-out operation for
classification and use the multi-scale testing strategy for inference. As a
consequence, the proposed face detection method achieves state-of-the-art
performance on the most popular and challenging face detection benchmark WIDER
FACE dataset.Comment: 9 pages, 3 figures, technical repor
- …