166 research outputs found
A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection
A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is
proposed for fast multi-scale object detection. The MS-CNN consists of a
proposal sub-network and a detection sub-network. In the proposal sub-network,
detection is performed at multiple output layers, so that receptive fields
match objects of different scales. These complementary scale-specific detectors
are combined to produce a strong multi-scale object detector. The unified
network is learned end-to-end, by optimizing a multi-task loss. Feature
upsampling by deconvolution is also explored, as an alternative to input
upsampling, to reduce the memory and computation costs. State-of-the-art object
detection performance, at up to 15 fps, is reported on datasets, such as KITTI
and Caltech, containing a substantial number of small objects
VANETs Meet Autonomous Vehicles: A Multimodal 3D Environment Learning Approach
In this paper, we design a multimodal framework for object detection,
recognition and mapping based on the fusion of stereo camera frames, point
cloud Velodyne Lidar scans, and Vehicle-to-Vehicle (V2V) Basic Safety Messages
(BSMs) exchanged using Dedicated Short Range Communication (DSRC). We merge the
key features of rich texture descriptions of objects from 2D images, depth and
distance between objects provided by 3D point cloud and awareness of hidden
vehicles from BSMs' 3D information. We present a joint pixel to point cloud and
pixel to V2V correspondences of objects in frames from the Kitti Vision
Benchmark Suite by using a semi-supervised manifold alignment approach to
achieve camera-Lidar and camera-V2V mapping of their recognized objects that
have the same underlying manifold.Comment: 7 pages, 12 figure
Accurate Face Detection for High Performance
Face detection has witnessed significant progress due to the advances of deep
convolutional neural networks (CNNs). Its central issue in recent years is how
to improve the detection performance of tiny faces. To this end, many recent
works propose some specific strategies, redesign the architecture and introduce
new loss functions for tiny object detection. In this report, we start from the
popular one-stage RetinaNet approach and apply some recent tricks to obtain a
high performance face detector. Specifically, we apply the Intersection over
Union (IoU) loss function for regression, employ the two-step classification
and regression for detection, revisit the data augmentation based on
data-anchor-sampling for training, utilize the max-out operation for
classification and use the multi-scale testing strategy for inference. As a
consequence, the proposed face detection method achieves state-of-the-art
performance on the most popular and challenging face detection benchmark WIDER
FACE dataset.Comment: 9 pages, 3 figures, technical repor
Improved Selective Refinement Network for Face Detection
As a long-standing problem in computer vision, face detection has attracted
much attention in recent decades for its practical applications. With the
availability of face detection benchmark WIDER FACE dataset, much of the
progresses have been made by various algorithms in recent years. Among them,
the Selective Refinement Network (SRN) face detector introduces the two-step
classification and regression operations selectively into an anchor-based face
detector to reduce false positives and improve location accuracy
simultaneously. Moreover, it designs a receptive field enhancement block to
provide more diverse receptive field. In this report, to further improve the
performance of SRN, we exploit some existing techniques via extensive
experiments, including new data augmentation strategy, improved backbone
network, MS COCO pretraining, decoupled classification module, segmentation
branch and Squeeze-and-Excitation block. Some of these techniques bring
performance improvements, while few of them do not well adapt to our baseline.
As a consequence, we present an improved SRN face detector by combining these
useful techniques together and obtain the best performance on widely used face
detection benchmark WIDER FACE dataset.Comment: Technical report, 8 pages, 6 figure
3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results
3D object detection is one of the most important tasks in 3D vision
perceptual system of autonomous vehicles. In this paper, we propose a novel two
stage 3D object detection method aimed at get the optimal solution of object
location in 3D space based on regressing two additional 3D object properties by
a deep convolutional neural network and combined with cascaded geometric
constraints between the 2D and 3D boxes. First, we modify the existing 3D
properties regressing network by adding two additional components, viewpoints
classification and the center projection of the 3D bounding box s bottom face.
Second, we use the predicted center projection combined with similar triangle
constraint to acquire an initial 3D bounding box by a closed-form solution.
Then, the location predicted by previous step is used as the initial value of
the over-determined equations constructed by 2D and 3D boxes fitting constraint
with the configuration determined with the classified viewpoint. Finally, we
use the recovered physical world information by the 3D detections to filter out
the false detection and false alarm in 2D detections. We compare our method
with the state-of-the-arts on the KITTI dataset show that although conceptually
simple, our method outperforms more complex and computational expensive methods
not only by improving the overall precision of 3D detections, but also
increasing the orientation estimation precision. Furthermore our method can
deal with the truncated objects to some extent and remove the false alarm and
false detections in both 2D and 3D detections
Audio-only Bird Species Automated Identification Method with Limited Training Data Based on Multi-Channel Deep Convolutional Neural Networks
Based on the transfer learning, we design a bird species identification model
that uses the VGG-16 model (pretrained on ImageNet) for feature extraction,
then a classifier consisting of two fully-connected hidden layers and a Softmax
layer is attached. We compare the performance of the proposed model with the
original VGG16 model. The results show that the former has higher train
efficiency, but lower mean average precisions(MAP). To improve the MAP of the
proposed model, we investigate the result fusion mode to form multi-channel
identification model, the best MAP reaches 0.9998. The number of model
parameters is 13110, which is only 0.0082% of the VGG16 model. Also, the size
demand of sample is decreased.Comment: 11 pages,11 figure
Object Detection in Specific Traffic Scenes using YOLOv2
object detection framework plays crucial role in autonomous driving. In this
paper, we introduce the real-time object detection framework called You Only
Look Once (YOLOv1) and the related improvements of YOLOv2. We further explore
the capability of YOLOv2 by implementing its pre-trained model to do the object
detecting tasks in some specific traffic scenes. The four artificially designed
traffic scenes include single-car, single-person, frontperson-rearcar and
frontcar-rearperson
Improving Object Detection from Scratch via Gated Feature Reuse
In this paper, we present a simple and parameter-efficient drop-in module for
one-stage object detectors like SSD when learning from scratch (i.e., without
pre-trained models). We call our module GFR (Gated Feature Reuse), which
exhibits two main advantages. First, we introduce a novel gate-controlled
prediction strategy enabled by Squeeze-and-Excitation to adaptively enhance or
attenuate supervision at different scales based on the input object size. As a
result, our model is more effective in detecting diverse sizes of objects.
Second, we propose a feature-pyramids structure to squeeze rich spatial and
semantic features into a single prediction layer, which strengthens feature
representation and reduces the number of parameters to learn. We apply the
proposed structure on DSOD and SSD detection frameworks, and evaluate the
performance on PASCAL VOC 2007, 2012 and COCO datasets. With fewer model
parameters, GFR-DSOD outperforms the baseline DSOD by 1.4%, 1.1%, 1.7% and
0.6%, respectively. GFR-SSD also outperforms the original SSD and SSD with
dense prediction by 3.6% and 2.8% on VOC 2007 dataset. Code is available at:
https://github.com/szq0214/GFR-DSOD .Comment: Accepted in BMVC 2019. Code: https://github.com/szq0214/GFR-DSO
Domain Adaptation from Synthesis to Reality in Single-model Detector for Video Smoke Detection
This paper proposes a method for video smoke detection using synthetic smoke
samples. The virtual data can automatically offer precise and rich annotated
samples. However, the learning of smoke representations will be hurt by the
appearance gap between real and synthetic smoke samples. The existed researches
mainly work on the adaptation to samples extracted from original annotated
samples. These methods take the object detection and domain adaptation as two
independent parts. To train a strong detector with rich synthetic samples, we
construct the adaptation to the detection layer of state-of-the-art
single-model detectors (SSD and MS-CNN). The training procedure is an
end-to-end stage. The classification, location and adaptation are combined in
the learning. The performance of the proposed model surpasses the original
baseline in our experiments. Meanwhile, our results show that the detectors
based on the adversarial adaptation are superior to the detectors based on the
discrepancy adaptation. Code will be made publicly available on
http://smoke.ustc.edu.cn. Moreover, the domain adaptation for two-stage
detector is described in Appendix A.Comment: The manuscript approved by all authors is our original work, and has
submitted to Pattern Recognition for peer review previously. There are 4532
words, 6 figures and 1 table in this manuscrip
CrowdHuman: A Benchmark for Detecting Human in a Crowd
Human detection has witnessed impressive progress in recent years. However,
the occlusion issue of detecting human in highly crowded environments is far
from solved. To make matters worse, crowd scenarios are still under-represented
in current human detection benchmarks. In this paper, we introduce a new
dataset, called CrowdHuman, to better evaluate detectors in crowd scenarios.
The CrowdHuman dataset is large, rich-annotated and contains high diversity.
There are a total of human instances from the train and validation
subsets, and persons per image, with various kinds of occlusions in the
dataset. Each human instance is annotated with a head bounding-box, human
visible-region bounding-box and human full-body bounding-box. Baseline
performance of state-of-the-art detection frameworks on CrowdHuman is
presented. The cross-dataset generalization results of CrowdHuman dataset
demonstrate state-of-the-art performance on previous dataset including
Caltech-USA, CityPersons, and Brainwash without bells and whistles. We hope our
dataset will serve as a solid baseline and help promote future research in
human detection tasks
- …