4,988 research outputs found
Context-Aware Single-Shot Detector
SSD is one of the state-of-the-art object detection algorithms, and it
combines high detection accuracy with real-time speed. However, it is widely
recognized that SSD is less accurate in detecting small objects compared to
large objects, because it ignores the context from outside the proposal boxes.
In this paper, we present CSSD--a shorthand for context-aware single-shot
multibox object detector. CSSD is built on top of SSD, with additional layers
modeling multi-scale contexts. We describe two variants of CSSD, which differ
in their context layers, using dilated convolution layers (DiCSSD) and
deconvolution layers (DeCSSD) respectively. The experimental results show that
the multi-scale context modeling significantly improves the detection accuracy.
In addition, we study the relationship between effective receptive fields
(ERFs) and the theoretical receptive fields (TRFs), particularly on a VGGNet.
The empirical results further strengthen our conclusion that SSD coupled with
context layers achieves better detection results especially for small objects
( on MS-COCO compared to the newest SSD), while
maintaining comparable runtime performance
Large-Scale Optical Neural Networks based on Photoelectric Multiplication
Recent success in deep neural networks has generated strong interest in
hardware accelerators to improve speed and energy consumption. This paper
presents a new type of photonic accelerator based on coherent detection that is
scalable to large () networks and can be operated at high (GHz)
speeds and very low (sub-aJ) energies per multiply-and-accumulate (MAC), using
the massive spatial multiplexing enabled by standard free-space optical
components. In contrast to previous approaches, both weights and inputs are
optically encoded so that the network can be reprogrammed and trained on the
fly. Simulations of the network using models for digit- and
image-classification reveal a "standard quantum limit" for optical neural
networks, set by photodetector shot noise. This bound, which can be as low as
50 zJ/MAC, suggests performance below the thermodynamic (Landauer) limit for
digital irreversible computation is theoretically possible in this device. The
proposed accelerator can implement both fully-connected and convolutional
networks. We also present a scheme for back-propagation and training that can
be performed in the same hardware. This architecture will enable a new class of
ultra-low-energy processors for deep learning.Comment: Text: 10 pages, 5 figures, 1 table. Supplementary: 8 pages, 5,
figures, 2 table
Scalable Object Detection using Deep Neural Networks
Deep convolutional neural networks have recently achieved state-of-the-art
performance on a number of image recognition benchmarks, including the ImageNet
Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on
the localization sub-task was a network that predicts a single bounding box and
a confidence score for each object category in the image. Such a model captures
the whole-image context around the objects but cannot handle multiple instances
of the same object in the image without naively replicating the number of
outputs for each instance. In this work, we propose a saliency-inspired neural
network model for detection, which predicts a set of class-agnostic bounding
boxes along with a single score for each box, corresponding to its likelihood
of containing any object of interest. The model naturally handles a variable
number of instances for each class and allows for cross-class generalization at
the highest levels of the network. We are able to obtain competitive
recognition performance on VOC2007 and ILSVRC2012, while using only the top few
predicted locations in each image and a small number of neural network
evaluations
Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
Understanding language goes hand in hand with the ability to integrate
complex contextual information obtained via perception. In this work, we
present a novel task for grounded language understanding: disambiguating a
sentence given a visual scene which depicts one of the possible interpretations
of that sentence. To this end, we introduce a new multimodal corpus containing
ambiguous sentences, representing a wide range of syntactic, semantic and
discourse ambiguities, coupled with videos that visualize the different
interpretations for each sentence. We address this task by extending a vision
model which determines if a sentence is depicted by a video. We demonstrate how
such a model can be adjusted to recognize different interpretations of the same
underlying sentence, allowing to disambiguate sentences in a unified fashion
across the different ambiguity types.Comment: EMNLP 201
- …