2,646 research outputs found
Feature Selective Anchor-Free Module for Single-Shot Object Detection
We motivate and present feature selective anchor-free (FSAF) module, a simple
and effective building block for single-shot object detectors. It can be
plugged into single-shot detectors with feature pyramid structure. The FSAF
module addresses two limitations brought up by the conventional anchor-based
detection: 1) heuristic-guided feature selection; 2) overlap-based anchor
sampling. The general concept of the FSAF module is online feature selection
applied to the training of multi-level anchor-free branches. Specifically, an
anchor-free branch is attached to each level of the feature pyramid, allowing
box encoding and decoding in the anchor-free manner at an arbitrary level.
During training, we dynamically assign each instance to the most suitable
feature level. At the time of inference, the FSAF module can work jointly with
anchor-based branches by outputting predictions in parallel. We instantiate
this concept with simple implementations of anchor-free branches and online
feature selection strategy. Experimental results on the COCO detection track
show that our FSAF module performs better than anchor-based counterparts while
being faster. When working jointly with anchor-based branches, the FSAF module
robustly improves the baseline RetinaNet by a large margin under various
settings, while introducing nearly free inference overhead. And the resulting
best model can achieve a state-of-the-art 44.6% mAP, outperforming all existing
single-shot detectors on COCO.Comment: CVPR 201
Consistent Optimization for Single-Shot Object Detection
We present consistent optimization for single stage object detection.
Previous works of single stage object detectors usually rely on the regular,
dense sampled anchors to generate hypothesis for the optimization of the model.
Through an examination of the behavior of the detector, we observe that the
misalignment between the optimization target and inference configurations has
hindered the performance improvement. We propose to bride this gap by
consistent optimization, which is an extension of the traditional single stage
detector's optimization strategy. Consistent optimization focuses on matching
the training hypotheses and the inference quality by utilizing of the refined
anchors during training. To evaluate its effectiveness, we conduct various
design choices based on the state-of-the-art RetinaNet detector. We demonstrate
it is the consistent optimization, not the architecture design, that yields the
performance boosts. Consistent optimization is nearly cost-free, and achieves
stable performance gains independent of the model capacities or input scales.
Specifically, utilizing consistent optimization improves RetinaNet from 39.1 AP
to 40.1 AP on COCO dataset without any bells or whistles, which surpasses the
accuracy of all existing state-of-the-art one-stage detectors when adopting
ResNet-101 as backbone. The code will be made available.Comment: Technical repor
IPG-Net: Image Pyramid Guidance Network for Small Object Detection
For Convolutional Neural Network-based object detection, there is a typical
dilemma: the spatial information is well kept in the shallow layers which
unfortunately do not have enough semantic information, while the deep layers
have a high semantic concept but lost a lot of spatial information, resulting
in serious information imbalance. To acquire enough semantic information for
shallow layers, Feature Pyramid Networks (FPN) is used to build a top-down
propagated path. In this paper, except for top-down combining of information
for shallow layers, we propose a novel network called Image Pyramid Guidance
Network (IPG-Net) to make sure both the spatial information and semantic
information are abundant for each layer. Our IPG-Net has two main parts: the
image pyramid guidance transformation module and the image pyramid guidance
fusion module. Our main idea is to introduce the image pyramid guidance into
the backbone stream to solve the information imbalance problem, which
alleviates the vanishment of the small object features. This IPG transformation
module promises even in the deepest stage of the backbone, there is enough
spatial information for bounding box regression and classification.
Furthermore, we designed an effective fusion module to fuse the features from
the image pyramid and features from the backbone stream. We have tried to apply
this novel network to both one-stage and two-stage detection models, state of
the art results are obtained on the most popular benchmark data sets, i.e. MS
COCO and Pascal VOC.Comment: Accepted by CVPR2020 Anti-UVA worksho
TrackNet: Simultaneous Object Detection and Tracking and Its Application in Traffic Video Analysis
Object detection and object tracking are usually treated as two separate
processes. Significant progress has been made for object detection in 2D images
using deep learning networks. The usual tracking-by-detection pipeline for
object tracking requires that the object is successfully detected in the first
frame and all subsequent frames, and tracking is done by associating detection
results. Performing object detection and object tracking through a single
network remains a challenging open question. We propose a novel network
structure named trackNet that can directly detect a 3D tube enclosing a moving
object in a video segment by extending the faster R-CNN framework. A Tube
Proposal Network (TPN) inside the trackNet is proposed to predict the
objectness of each candidate tube and location parameters specifying the
bounding tube. The proposed framework is applicable for detecting and tracking
any object and in this paper, we focus on its application for traffic video
analysis. The proposed model is trained and tested on UA-DETRAC, a large
traffic video dataset available for multi-vehicle detection and tracking, and
obtained very promising results
Deep Learning for Generic Object Detection: A Survey
Object detection, one of the most fundamental and challenging problems in
computer vision, seeks to locate object instances from a large number of
predefined categories in natural images. Deep learning techniques have emerged
as a powerful strategy for learning feature representations directly from data
and have led to remarkable breakthroughs in the field of generic object
detection. Given this period of rapid evolution, the goal of this paper is to
provide a comprehensive survey of the recent achievements in this field brought
about by deep learning techniques. More than 300 research contributions are
included in this survey, covering many aspects of generic object detection:
detection frameworks, object feature representation, object proposal
generation, context modeling, training strategies, and evaluation metrics. We
finish the survey by identifying promising directions for future research.Comment: IJCV Mino
Scale-Equalizing Pyramid Convolution for Object Detection
Feature pyramid has been an efficient method to extract features at different
scales. Development over this method mainly focuses on aggregating contextual
information at different levels while seldom touching the inter-level
correlation in the feature pyramid. Early computer vision methods extracted
scale-invariant features by locating the feature extrema in both spatial and
scale dimension. Inspired by this, a convolution across the pyramid level is
proposed in this study, which is termed pyramid convolution and is a modified
3-D convolution. Stacked pyramid convolutions directly extract 3-D (scale and
spatial) features and outperforms other meticulously designed feature fusion
modules. Based on the viewpoint of 3-D convolution, an integrated batch
normalization that collects statistics from the whole feature pyramid is
naturally inserted after the pyramid convolution. Furthermore, we also show
that the naive pyramid convolution, together with the design of RetinaNet head,
actually best applies for extracting features from a Gaussian pyramid, whose
properties can hardly be satisfied by a feature pyramid. In order to alleviate
this discrepancy, we build a scale-equalizing pyramid convolution (SEPC) that
aligns the shared pyramid convolution kernel only at high-level feature maps.
Being computationally efficient and compatible with the head design of most
single-stage object detectors, the SEPC module brings significant performance
improvement (AP increase on MS-COCO2017 dataset) in state-of-the-art
one-stage object detectors, and a light version of SEPC also has AP
gain with only around 7% inference time increase. The pyramid convolution also
functions well as a stand-alone module in two-stage object detectors and is
able to improve the performance by AP. The source code can be found at
https://github.com/jshilong/SEPC.Comment: Accepted by CVPR202
HAR-Net: Joint Learning of Hybrid Attention for Single-stage Object Detection
Object detection has been a challenging task in computer vision. Although
significant progress has been made in object detection with deep neural
networks, the attention mechanism is far from development. In this paper, we
propose the hybrid attention mechanism for single-stage object detection.
First, we present the modules of spatial attention, channel attention and
aligned attention for single-stage object detection. In particular, stacked
dilated convolution layers with symmetrically fixed rates are constructed to
learn spatial attention. The channel attention is proposed with the cross-level
group normalization and squeeze-and-excitation module. Aligned attention is
constructed with organized deformable filters. Second, the three kinds of
attention are unified to construct the hybrid attention mechanism. We then
embed the hybrid attention into Retina-Net and propose the efficient
single-stage HAR-Net for object detection. The attention modules and the
proposed HAR-Net are evaluated on the COCO detection dataset. Experiments
demonstrate that hybrid attention can significantly improve the detection
accuracy and the HAR-Net can achieve the state-of-the-art 45.8\% mAP,
outperform existing single-stage object detectors
Feature Selective Small Object Detection via Knowledge-based Recurrent Attentive Neural Network
At present, the performance of deep neural network in general object
detection is comparable to or even surpasses that of human beings. However, due
to the limitations of deep learning itself, the small proportion of feature
pixels, and the occurence of blur and occlusion, the detection of small objects
in complex scenes is still an open question. But we can not deny that real-time
and accurate object detection is fundamental to automatic perception and
subsequent perception-based decision-making and planning tasks of autonomous
driving.
Considering the characteristics of small objects in autonomous driving scene,
we proposed a novel method named KB-RANN, which based on domain knowledge,
intuitive experience and feature attentive selection. It can focus on
particular parts of image features, and then it tries to stress the importance
of these features and strengthenes the learning parameters of them. Our
comparative experiments on KITTI and COCO datasets show that our proposed
method can achieve considerable results both in speed and accuracy, and can
improve the effect of small object detection through self-selection of
important features and continuous enhancement of proposed method, and deployed
it in our self-developed autonomous driving car
WW-Nets: Dual Neural Networks for Object Detection
We propose a new deep convolutional neural network framework that uses object
location knowledge implicit in network connection weights to guide selective
attention in object detection tasks. Our approach is called What-Where Nets
(WW-Nets), and it is inspired by the structure of human visual pathways. In the
brain, vision incorporates two separate streams, one in the temporal lobe and
the other in the parietal lobe, called the ventral stream and the dorsal
stream, respectively. The ventral pathway from primary visual cortex is
dominated by "what" information, while the dorsal pathway is dominated by
"where" information. Inspired by this structure, we have proposed an object
detection framework involving the integration of a "What Network" and a "Where
Network". The aim of the What Network is to provide selective attention to the
relevant parts of the input image. The Where Network uses this information to
locate and classify objects of interest. In this paper, we compare this
approach to state-of-the-art algorithms on the PASCAL VOC 2007 and 2012 and
COCO object detection challenge datasets. Also, we compare out approach to
human "ground-truth" attention. We report the results of an eye-tracking
experiment on human subjects using images from PASCAL VOC 2007, and we
demonstrate interesting relationships between human overt attention and
information processing in our WW-Nets. Finally, we provide evidence that our
proposed method performs favorably in comparison to other object detection
approaches, often by a large margin. The code and the eye-tracking ground-truth
dataset can be found at: https://github.com/mkebrahimpour.Comment: 8 pages, 3 figure
Multiple receptive fields and small-object-focusing weakly-supervised segmentation network for fast object detection
Object detection plays an important role in various visual applications.
However, the precision and speed of detector are usually contradictory. One
main reason for fast detectors' precision reduction is that small objects are
hard to be detected. To address this problem, we propose a multiple receptive
field and small-object-focusing weakly-supervised segmentation network
(MRFSWSnet) to achieve fast object detection. In MRFSWSnet, multiple receptive
fields block (MRF) is used to pay attention to the object and its adjacent
background's different spatial location with different weights to enhance the
feature's discriminability. In addition, in order to improve the accuracy of
small object detection, a small-object-focusing weakly-supervised segmentation
module which only focuses on small object instead of all objects is integrated
into the detection network for auxiliary training to improve the precision of
small object detection. Extensive experiments show the effectiveness of our
method on both PASCAL VOC and MS COCO detection datasets. In particular, with a
lower resolution version of 300x300, MRFSWSnet achieves 80.9% mAP on VOC2007
test with an inference speed of 15 milliseconds per frame, which is the
state-of-the-art detector among real-time detectors
- …