6,425 research outputs found
RNNs Implicitly Implement Tensor Product Representations
Recurrent neural networks (RNNs) can learn continuous vector representations
of symbolic structures such as sequences and sentences; these representations
often exhibit linear regularities (analogies). Such regularities motivate our
hypothesis that RNNs that show such regularities implicitly compile symbolic
structures into tensor product representations (TPRs; Smolensky, 1990), which
additively combine tensor products of vectors representing roles (e.g.,
sequence positions) and vectors representing fillers (e.g., particular words).
To test this hypothesis, we introduce Tensor Product Decomposition Networks
(TPDNs), which use TPRs to approximate existing vector representations. We
demonstrate using synthetic data that TPDNs can successfully approximate linear
and tree-based RNN autoencoder representations, suggesting that these
representations exhibit interpretable compositional structure; we explore the
settings that lead RNNs to induce such structure-sensitive representations. By
contrast, further TPDN experiments show that the representations of four models
trained to encode naturally-occurring sentences can be largely approximated
with a bag of words, with only marginal improvements from more sophisticated
structures. We conclude that TPDNs provide a powerful method for interpreting
vector representations, and that standard RNNs can induce compositional
sequence representations that are remarkably well approximated by TPRs; at the
same time, existing training tasks for sentence representation learning may not
be sufficient for inducing robust structural representations.Comment: Accepted to ICLR 201
Beyond Short Snippets: Deep Networks for Video Classification
Convolutional neural networks (CNNs) have been extensively applied for image
recognition problems giving state-of-the-art results on recognition, detection,
segmentation and retrieval. In this work we propose and evaluate several deep
neural network architectures to combine image information across a video over
longer time periods than previously attempted. We propose two methods capable
of handling full length videos. The first method explores various convolutional
temporal feature pooling architectures, examining the various design choices
which need to be made when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence of frames. For this
purpose we employ a recurrent neural network that uses Long Short-Term Memory
(LSTM) cells which are connected to the output of the underlying CNN. Our best
networks exhibit significant performance improvements over previously published
results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101
datasets with (88.6% vs. 88.0%) and without additional optical flow information
(82.6% vs. 72.8%)
Convolutional Drift Networks for Video Classification
Analyzing spatio-temporal data like video is a challenging task that requires
processing visual and temporal information effectively. Convolutional Neural
Networks have shown promise as baseline fixed feature extractors through
transfer learning, a technique that helps minimize the training cost on visual
information. Temporal information is often handled using hand-crafted features
or Recurrent Neural Networks, but this can be overly specific or prohibitively
complex. Building a fully trainable system that can efficiently analyze
spatio-temporal data without hand-crafted features or complex training is an
open challenge. We present a new neural network architecture to address this
challenge, the Convolutional Drift Network (CDN). Our CDN architecture combines
the visual feature extraction power of deep Convolutional Neural Networks with
the intrinsically efficient temporal processing provided by Reservoir
Computing. In this introductory paper on the CDN, we provide a very simple
baseline implementation tested on two egocentric (first-person) video activity
datasets.We achieve video-level activity classification results on-par with
state-of-the art methods. Notably, performance on this complex spatio-temporal
task was produced by only training a single feed-forward layer in the CDN.Comment: Published in IEEE Rebooting Computin
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
We propose a neural sequence-to-sequence model for direction following, a
task that is essential to realizing effective autonomous agents. Our
alignment-based encoder-decoder model with long short-term memory recurrent
neural networks (LSTM-RNN) translates natural language instructions to action
sequences based upon a representation of the observable world state. We
introduce a multi-level aligner that empowers our model to focus on sentence
"regions" salient to the current world state by using multiple abstractions of
the input sentence. In contrast to existing methods, our model uses no
specialized linguistic resources (e.g., parsers) or task-specific annotations
(e.g., seed lexicons). It is therefore generalizable, yet still achieves the
best results reported to-date on a benchmark single-sentence dataset and
competitive results for the limited-training multi-sentence setting. We analyze
our model through a series of ablations that elucidate the contributions of the
primary components of our model.Comment: To appear at AAAI 2016 (and an extended version of a NIPS 2015
Multimodal Machine Learning workshop paper
Recasting Residual-based Local Descriptors as Convolutional Neural Networks: an Application to Image Forgery Detection
Local descriptors based on the image noise residual have proven extremely
effective for a number of forensic applications, like forgery detection and
localization. Nonetheless, motivated by promising results in computer vision,
the focus of the research community is now shifting on deep learning. In this
paper we show that a class of residual-based descriptors can be actually
regarded as a simple constrained convolutional neural network (CNN). Then, by
relaxing the constraints, and fine-tuning the net on a relatively small
training set, we obtain a significant performance improvement with respect to
the conventional detector
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify
temporal segments corresponding to the actions, and learn models that
generalize to unconstrained web videos. We find that web images queried by
action names serve as well-localized highlights for many actions, but are
noisily labeled. To solve this problem, we propose a simple yet effective
method that takes weak video labels and noisy image labels as input, and
generates localized action frames as output. This is achieved by cross-domain
transfer between video frames and web images, using pre-trained deep
convolutional neural networks. We then use the localized action frames to train
action recognition models with long short-term memory networks. We collect a
fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014
localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
- …