37,998 research outputs found
Learning to Reason: End-to-End Module Networks for Visual Question Answering
Natural language questions are inherently compositional, and many are most
easily answered by reasoning about their decomposition into modular
sub-problems. For example, to answer "is there an equal number of balls and
boxes?" we can look for balls, look for boxes, count them, and compare the
results. The recently proposed Neural Module Network (NMN) architecture
implements this approach to question answering by parsing questions into
linguistic substructures and assembling question-specific deep networks from
smaller modules that each solve one subtask. However, existing NMN
implementations rely on brittle off-the-shelf parsers, and are restricted to
the module configurations proposed by these parsers rather than learning them
from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which
learn to reason by directly predicting instance-specific network layouts
without the aid of a parser. Our model learns to generate network structures
(by imitating expert demonstrations) while simultaneously learning network
parameters (using the downstream task loss). Experimental results on the new
CLEVR dataset targeted at compositional question answering show that N2NMNs
achieve an error reduction of nearly 50% relative to state-of-the-art
attentional approaches, while discovering interpretable network architectures
specialized for each question
Predictive Coding for Dynamic Visual Processing: Development of Functional Hierarchy in a Multiple Spatio-Temporal Scales RNN Model
The current paper proposes a novel predictive coding type neural network
model, the predictive multiple spatio-temporal scales recurrent neural network
(P-MSTRNN). The P-MSTRNN learns to predict visually perceived human whole-body
cyclic movement patterns by exploiting multiscale spatio-temporal constraints
imposed on network dynamics by using differently sized receptive fields as well
as different time constant values for each layer. After learning, the network
becomes able to proactively imitate target movement patterns by inferring or
recognizing corresponding intentions by means of the regression of prediction
error. Results show that the network can develop a functional hierarchy by
developing a different type of dynamic structure at each layer. The paper
examines how model performance during pattern generation as well as predictive
imitation varies depending on the stage of learning. The number of limit cycle
attractors corresponding to target movement patterns increases as learning
proceeds. And, transient dynamics developing early in the learning process
successfully perform pattern generation and predictive imitation tasks. The
paper concludes that exploitation of transient dynamics facilitates successful
task performance during early learning periods.Comment: Accepted in Neural Computation (MIT press
Deep Contrast Learning for Salient Object Detection
Salient object detection has recently witnessed substantial progress due to
powerful features extracted using deep convolutional neural networks (CNNs).
However, existing CNN-based methods operate at the patch level instead of the
pixel level. Resulting saliency maps are typically blurry, especially near the
boundary of salient objects. Furthermore, image patches are treated as
independent samples even when they are overlapping, giving rise to significant
redundancy in computation and storage. In this CVPR 2016 paper, we propose an
end-to-end deep contrast network to overcome the aforementioned limitations.
Our deep network consists of two complementary components, a pixel-level fully
convolutional stream and a segment-wise spatial pooling stream. The first
stream directly produces a saliency map with pixel-level accuracy from an input
image. The second stream extracts segment-wise features very efficiently, and
better models saliency discontinuities along object boundaries. Finally, a
fully connected CRF model can be optionally incorporated to improve spatial
coherence and contour localization in the fused result from these two streams.
Experimental results demonstrate that our deep model significantly improves the
state of the art.Comment: To appear in CVPR 201
Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks
It is desirable to train convolutional networks (CNNs) to run more
efficiently during inference. In many cases however, the computational budget
that the system has for inference cannot be known beforehand during training,
or the inference budget is dependent on the changing real-time resource
availability. Thus, it is inadequate to train just inference-efficient CNNs,
whose inference costs are not adjustable and cannot adapt to varied inference
budgets. We propose a novel approach for cost-adjustable inference in CNNs -
Stochastic Downsampling Point (SDPoint). During training, SDPoint applies
feature map downsampling to a random point in the layer hierarchy, with a
random downsampling ratio. The different stochastic downsampling configurations
known as SDPoint instances (of the same model) have computational costs
different from each other, while being trained to minimize the same prediction
loss. Sharing network parameters across different instances provides
significant regularization boost. During inference, one may handpick a SDPoint
instance that best fits the inference budget. The effectiveness of SDPoint, as
both a cost-adjustable inference approach and a regularizer, is validated
through extensive experiments on image classification
- …