55,890 research outputs found
Clustering and Recognition of Spatiotemporal Features through Interpretable Embedding of Sequence to Sequence Recurrent Neural Networks
Encoder-decoder recurrent neural network models (RNN Seq2Seq) have achieved
great success in ubiquitous areas of computation and applications. It was shown
to be successful in modeling data with both temporal and spatial dependencies
for translation or prediction tasks. In this study, we propose an embedding
approach to visualize and interpret the representation of data by these models.
Furthermore, we show that the embedding is an effective method for unsupervised
learning and can be utilized to estimate the optimality of model training. In
particular, we demonstrate that embedding space projections of the decoder
states of RNN Seq2Seq model trained on sequences prediction are organized in
clusters capturing similarities and differences in the dynamics of these
sequences. Such performance corresponds to an unsupervised clustering of any
spatio-temporal features and can be employed for time-dependent problems such
as temporal segmentation, clustering of dynamic activity, self-supervised
classification, action recognition, failure prediction, etc. We test and
demonstrate the application of the embedding methodology to time-sequences of
3D human body poses. We show that the methodology provides a high-quality
unsupervised categorization of movements
Modeling Latent Attention Within Neural Networks
Deep neural networks are able to solve tasks across a variety of domains and
modalities of data. Despite many empirical successes, we lack the ability to
clearly understand and interpret the learned internal mechanisms that
contribute to such effective behaviors or, more critically, failure modes. In
this work, we present a general method for visualizing an arbitrary neural
network's inner mechanisms and their power and limitations. Our dataset-centric
method produces visualizations of how a trained network attends to components
of its inputs. The computed "attention masks" support improved interpretability
by highlighting which input attributes are critical in determining output. We
demonstrate the effectiveness of our framework on a variety of deep neural
network architectures in domains from computer vision, natural language
processing, and reinforcement learning. The primary contribution of our
approach is an interpretable visualization of attention that provides unique
insights into the network's underlying decision-making process irrespective of
the data modality
Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition
Recently, great progress has been made for online handwritten Chinese
character recognition due to the emergence of deep learning techniques.
However, previous research mostly treated each Chinese character as one class
without explicitly considering its inherent structure, namely the radical
components with complicated geometry. In this study, we propose a novel
trajectory-based radical analysis network (TRAN) to firstly identify radicals
and analyze two-dimensional structures among radicals simultaneously, then
recognize Chinese characters by generating captions of them based on the
analysis of their internal radicals. The proposed TRAN employs recurrent neural
networks (RNNs) as both an encoder and a decoder. The RNN encoder makes full
use of online information by directly transforming handwriting trajectory into
high-level features. The RNN decoder aims at generating the caption by
detecting radicals and spatial structures through an attention model. The
manner of treating a Chinese character as a two-dimensional composition of
radicals can reduce the size of vocabulary and enable TRAN to possess the
capability of recognizing unseen Chinese character classes, only if the
corresponding radicals have been seen. Evaluated on CASIA-OLHWDB database, the
proposed approach significantly outperforms the state-of-the-art
whole-character modeling approach with a relative character error rate (CER)
reduction of 10%. Meanwhile, for the case of recognition of 500 unseen Chinese
characters, TRAN can achieve a character accuracy of about 60% while the
traditional whole-character method has no capability to handle them
Towards Distortion-Predictable Embedding of Neural Networks
Current research in Computer Vision has shown that Convolutional Neural
Networks (CNN) give state-of-the-art performance in many classification tasks
and Computer Vision problems. The embedding of CNN, which is the internal
representation produced by the last layer, can indirectly learn topological and
relational properties. Moreover, by using a suitable loss function, CNN models
can learn invariance to a wide range of non-linear distortions such as
rotation, viewpoint angle or lighting condition. In this work, new insights are
discovered about CNN embeddings and a new loss function is proposed, derived
from the contrastive loss, that creates models with more predicable mappings
and also quantifies distortions. In typical distortion-dependent methods, there
is no simple relation between the features corresponding to one image and the
features of this image distorted. Therefore, these methods require to
feed-forward inputs under every distortions in order to find the corresponding
features representations. Our contribution makes a step towards embeddings
where features of distorted inputs are related and can be derived from each
others by the intensity of the distortion.Comment: 54 pages, 28 figures. Master project at EPFL (Switzerland) in 2015.
For source code on GitHub, see https://github.com/axel-angel/master-projec
MCRM: Mother Compact Recurrent Memory
LSTMs and GRUs are the most common recurrent neural network architectures
used to solve temporal sequence problems. The two architectures have differing
data flows dealing with a common component called the cell state (also referred
to as the memory). We attempt to enhance the memory by presenting a
modification that we call the Mother Compact Recurrent Memory (MCRM). MCRMs are
a type of a nested LSTM-GRU architecture where the cell state is the GRU hidden
state. The concatenation of the forget gate and input gate interactions from
the LSTM are considered an input to the GRU cell. Because MCRMs has this type
of nesting, MCRMs have a compact memory pattern consisting of neurons that acts
explicitly in both long-term and short-term fashions. For some specific tasks,
empirical results show that MCRMs outperform previously used architectures.Comment: Submitted to AAAI-1
A Generative Model for Volume Rendering
We present a technique to synthesize and analyze volume-rendered images using
generative models. We use the Generative Adversarial Network (GAN) framework to
compute a model from a large collection of volume renderings, conditioned on
(1) viewpoint and (2) transfer functions for opacity and color. Our approach
facilitates tasks for volume analysis that are challenging to achieve using
existing rendering techniques such as ray casting or texture-based methods. We
show how to guide the user in transfer function editing by quantifying expected
change in the output image. Additionally, the generative model transforms
transfer functions into a view-invariant latent space specifically designed to
synthesize volume-rendered images. We use this space directly for rendering,
enabling the user to explore the space of volume-rendered images. As our model
is independent of the choice of volume rendering process, we show how to
analyze volume-rendered images produced by direct and global illumination
lighting, for a variety of volume datasets
Parallel Attention Mechanisms in Neural Machine Translation
Recent papers in neural machine translation have proposed the strict use of
attention mechanisms over previous standards such as recurrent and
convolutional neural networks (RNNs and CNNs). We propose that by running
traditionally stacked encoding branches from encoder-decoder attention- focused
architectures in parallel, that even more sequential operations can be removed
from the model and thereby decrease training time. In particular, we modify the
recently published attention-based architecture called Transformer by Google,
by replacing sequential attention modules with parallel ones, reducing the
amount of training time and substantially improving BLEU scores at the same
time. Experiments over the English to German and English to French translation
tasks show that our model establishes a new state of the art.Comment: ICMLA 2018, 6 page
Image Captioning Based on a Hierarchical Attention Mechanism and Policy Gradient Optimization
Automatically generating the descriptions of an image, i.e., image
captioning, is an important and fundamental topic in artificial intelligence,
which bridges the gap between computer vision and natural language processing.
Based on the successful deep learning models, especially the CNN model and Long
Short-Term Memories (LSTMs) with attention mechanism, we propose a hierarchical
attention model by utilizing both of the global CNN features and the local
object features for more effective feature representation and reasoning in
image captioning. The generative adversarial network (GAN), together with a
reinforcement learning (RL) algorithm, is applied to solve the exposure bias
problem in RNN-based supervised training for language problems. In addition,
through the automatic measurement of the consistency between the generated
caption and the image content by the discriminator in the GAN framework and RL
optimization, we make the finally generated sentences more accurate and
natural. Comprehensive experiments show the improved performance of the
hierarchical attention mechanism and the effectiveness of our RL-based
optimization method. Our model achieves state-of-the-art results on several
important metrics in the MSCOCO dataset, using only greedy inference
A GRU-based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition
In this study, we present a novel end-to-end approach based on the
encoder-decoder framework with the attention mechanism for online handwritten
mathematical expression recognition (OHMER). First, the input two-dimensional
ink trajectory information of handwritten expression is encoded via the gated
recurrent unit based recurrent neural network (GRU-RNN). Then the decoder is
also implemented by the GRU-RNN with a coverage-based attention model. The
proposed approach can simultaneously accomplish the symbol recognition and
structural analysis to output a character sequence in LaTeX format. Validated
on the CROHME 2014 competition task, our approach significantly outperforms the
state-of-the-art with an expression recognition accuracy of 52.43% by only
using the official training dataset. Furthermore, the alignments between the
input trajectories of handwritten expressions and the output LaTeX sequences
are visualized by the attention mechanism to show the effectiveness of the
proposed method.Comment: Accepted by ICDAR 2017 conferenc
Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks
Linking human whole-body motion and natural language is of great interest for
the generation of semantic representations of observed human behaviors as well
as for the generation of robot behaviors based on natural language input. While
there has been a large body of research in this area, most approaches that
exist today require a symbolic representation of motions (e.g. in the form of
motion primitives), which have to be defined a-priori or require complex
segmentation algorithms. In contrast, recent advances in the field of neural
networks and especially deep learning have demonstrated that sub-symbolic
representations that can be learned end-to-end usually outperform more
traditional approaches, for applications such as machine translation. In this
paper we propose a generative model that learns a bidirectional mapping between
human whole-body motion and natural language using deep recurrent neural
networks (RNNs) and sequence-to-sequence learning. Our approach does not
require any segmentation or manual feature engineering and learns a distributed
representation, which is shared for all motions and descriptions. We evaluate
our approach on 2,846 human whole-body motions and 6,187 natural language
descriptions thereof from the KIT Motion-Language Dataset. Our results clearly
demonstrate the effectiveness of the proposed model: We show that our model
generates a wide variety of realistic motions only from descriptions thereof in
form of a single sentence. Conversely, our model is also capable of generating
correct and detailed natural language descriptions from human motions
- …