37,558 research outputs found
DC-SPP-YOLO: Dense Connection and Spatial Pyramid Pooling Based YOLO for Object Detection
Although YOLOv2 approach is extremely fast on object detection; its backbone
network has the low ability on feature extraction and fails to make full use of
multi-scale local region features, which restricts the improvement of object
detection accuracy. Therefore, this paper proposed a DC-SPP-YOLO (Dense
Connection and Spatial Pyramid Pooling Based YOLO) approach for ameliorating
the object detection accuracy of YOLOv2. Specifically, the dense connection of
convolution layers is employed in the backbone network of YOLOv2 to strengthen
the feature extraction and alleviate the vanishing-gradient problem. Moreover,
an improved spatial pyramid pooling is introduced to pool and concatenate the
multi-scale local region features, so that the network can learn the object
features more comprehensively. The DC-SPP-YOLO model is established and trained
based on a new loss function composed of mean square error and cross entropy,
and the object detection is realized. Experiments demonstrate that the mAP
(mean Average Precision) of DC-SPP-YOLO proposed on PASCAL VOC datasets and
UA-DETRAC datasets is higher than that of YOLOv2; the object detection accuracy
of DC-SPP-YOLO is superior to YOLOv2 by strengthening feature extraction and
using the multi-scale local region features.Comment: 23 pages, 9 figures, 9 table
Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video
Object detection is considered one of the most challenging problems in this
field of computer vision, as it involves the combination of object
classification and object localization within a scene. Recently, deep neural
networks (DNNs) have been demonstrated to achieve superior object detection
performance compared to other approaches, with YOLOv2 (an improved You Only
Look Once model) being one of the state-of-the-art in DNN-based object
detection methods in terms of both speed and accuracy. Although YOLOv2 can
achieve real-time performance on a powerful GPU, it still remains very
challenging for leveraging this approach for real-time object detection in
video on embedded computing devices with limited computational power and
limited memory. In this paper, we propose a new framework called Fast YOLO, a
fast You Only Look Once framework which accelerates YOLOv2 to be able to
perform object detection in video on embedded devices in a real-time manner.
First, we leverage the evolutionary deep intelligence framework to evolve the
YOLOv2 network architecture and produce an optimized architecture (referred to
as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To
further reduce power consumption on embedded devices while maintaining
performance, a motion-adaptive inference method is introduced into the proposed
Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2
based on temporal motion characteristics. Experimental results show that the
proposed Fast YOLO framework can reduce the number of deep inferences by an
average of 38.13%, and an average speedup of ~3.3X for objection detection in
video compared to the original YOLOv2, leading Fast YOLO to run an average of
~18FPS on a Nvidia Jetson TX1 embedded system
OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts
Generating captions for images is a task that has recently received
considerable attention. In this work we focus on caption generation for
abstract scenes, or object layouts where the only information provided is a set
of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence
model that encodes a set of objects and their locations as an input sequence
using an LSTM network, and decodes this representation using an LSTM language
model. We show that our model, despite encoding object layouts as a sequence,
can represent spatial relationships between objects, and generate descriptions
that are globally coherent and semantically relevant. We test our approach in a
task of object-layout captioning by using only object annotations as inputs. We
additionally show that our model, combined with a state-of-the-art object
detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score)
in the test benchmark of the standard MS-COCO Captioning task.Comment: Accepted at EMNLP 201
- …
