2,154 research outputs found
MCRM: Mother Compact Recurrent Memory
LSTMs and GRUs are the most common recurrent neural network architectures
used to solve temporal sequence problems. The two architectures have differing
data flows dealing with a common component called the cell state (also referred
to as the memory). We attempt to enhance the memory by presenting a
modification that we call the Mother Compact Recurrent Memory (MCRM). MCRMs are
a type of a nested LSTM-GRU architecture where the cell state is the GRU hidden
state. The concatenation of the forget gate and input gate interactions from
the LSTM are considered an input to the GRU cell. Because MCRMs has this type
of nesting, MCRMs have a compact memory pattern consisting of neurons that acts
explicitly in both long-term and short-term fashions. For some specific tasks,
empirical results show that MCRMs outperform previously used architectures.Comment: Submitted to AAAI-1
Nested LSTMs
We propose Nested LSTMs (NLSTM), a novel RNN architecture with multiple
levels of memory. Nested LSTMs add depth to LSTMs via nesting as opposed to
stacking. The value of a memory cell in an NLSTM is computed by an LSTM cell,
which has its own inner memory cell. Specifically, instead of computing the
value of the (outer) memory cell as , NLSTM memory cells use the concatenation as input to an inner LSTM (or NLSTM) memory cell, and set
= . Nested LSTMs outperform both stacked and
single-layer LSTMs with similar numbers of parameters in our experiments on
various character-level language modeling tasks, and the inner memories of an
LSTM learn longer term dependencies compared with the higher-level units of a
stacked LSTM.Comment: Accepted at ACML 201
Understanding Recurrent Neural State Using Memory Signatures
We demonstrate a network visualization technique to analyze the recurrent
state inside the LSTMs/GRUs used commonly in language and acoustic models.
Interpreting intermediate state and network activations inside end-to-end
models remains an open challenge. Our method allows users to understand exactly
how much and what history is encoded inside recurrent state in grapheme
sequence models. Our procedure trains multiple decoders that predict prior
input history. Compiling results from these decoders, a user can obtain a
signature of the recurrent kernel that characterizes its memory behavior. We
demonstrate this method's usefulness in revealing information divergence in the
bases of recurrent factorized kernels, visualizing the character-level
differences between the memory of n-gram and recurrent language models, and
extracting knowledge of history encoded in the layers of grapheme-based
end-to-end ASR networks.Comment: Accepted to 2018 IEEE International Conference on Acoustics, Speech
and Signal Processin
AirScript - Creating Documents in Air
This paper presents a novel approach, called AirScript, for creating,
recognizing and visualizing documents in air. We present a novel algorithm,
called 2-DifViz, that converts the hand movements in air (captured by a
Myo-armband worn by a user) into a sequence of x, y coordinates on a 2D
Cartesian plane, and visualizes them on a canvas. Existing sensor-based
approaches either do not provide visual feedback or represent the recognized
characters using prefixed templates. In contrast, AirScript stands out by
giving freedom of movement to the user, as well as by providing a real-time
visual feedback of the written characters, making the interaction natural.
AirScript provides a recognition module to predict the content of the document
created in air. To do so, we present a novel approach based on deep learning,
which uses the sensor data and the visualizations created by 2-DifViz. The
recognition module consists of a Convolutional Neural Network (CNN) and two
Gated Recurrent Unit (GRU) Networks. The output from these three networks is
fused to get the final prediction about the characters written in air.
AirScript can be used in highly sophisticated environments like a smart
classroom, a smart factory or a smart laboratory, where it would enable people
to annotate pieces of texts wherever they want without any reference surface.
We have evaluated AirScript against various well-known learning models (HMM,
KNN, SVM, etc.) on the data of 12 participants. Evaluation results show that
the recognition module of AirScript largely outperforms all of these models by
achieving an accuracy of 91.7% in a person independent evaluation and a 96.7%
accuracy in a person dependent evaluation
ChronoNet: A Deep Recurrent Neural Network for Abnormal EEG Identification
Brain-related disorders such as epilepsy can be diagnosed by analyzing
electroencephalograms (EEG). However, manual analysis of EEG data requires
highly trained clinicians, and is a procedure that is known to have relatively
low inter-rater agreement (IRA). Moreover, the volume of the data and the rate
at which new data becomes available make manual interpretation a
time-consuming, resource-hungry, and expensive process. In contrast, automated
analysis of EEG data offers the potential to improve the quality of patient
care by shortening the time to diagnosis and reducing manual error. In this
paper, we focus on one of the first steps in interpreting an EEG session -
identifying whether the brain activity is abnormal or normal. To solve this
task, we propose a novel recurrent neural network (RNN) architecture termed
ChronoNet which is inspired by recent developments from the field of image
classification and designed to work efficiently with EEG data. ChronoNet is
formed by stacking multiple 1D convolution layers followed by deep gated
recurrent unit (GRU) layers where each 1D convolution layer uses multiple
filters of exponentially varying lengths and the stacked GRU layers are densely
connected in a feed-forward manner. We used the recently released TUH Abnormal
EEG Corpus dataset for evaluating the performance of ChronoNet. Unlike previous
studies using this dataset, ChronoNet directly takes time-series EEG as input
and learns meaningful representations of brain activity patterns. ChronoNet
outperforms the previously reported best results by 7.79% thereby setting a new
benchmark for this dataset. Furthermore, we demonstrate the domain-independent
nature of ChronoNet by successfully applying it to classify speech commands.Comment: 8 pages, 2 figures, 2 table
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
Coupled Recurrent Network (CRN)
Many semantic video analysis tasks can benefit from multiple, heterogenous
signals. For example, in addition to the original RGB input sequences,
sequences of optical flow are usually used to boost the performance of human
action recognition in videos. To learn from these heterogenous input sources,
existing methods reply on two-stream architectural designs that contain
independent, parallel streams of Recurrent Neural Networks (RNNs). However,
two-stream RNNs do not fully exploit the reciprocal information contained in
the multiple signals, let alone exploit it in a recurrent manner. To this end,
we propose in this paper a novel recurrent architecture, termed Coupled
Recurrent Network (CRN), to deal with multiple input sources. In CRN, the
parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent
Interpretation Block (RIB) that supports learning of reciprocal feature
representations from multiple signals in a recurrent manner. Different from
RNNs which stack the training loss at each time step or the last time step, we
propose an effective and efficient training strategy for CRN. Experiments show
the efficacy of the proposed CRN. In particular, we achieve the new state of
the art on the benchmark datasets of human action recognition and multi-person
pose estimation
An End-to-End Approach to Automatic Speech Assessment for Cantonese-speaking People with Aphasia
Conventional automatic assessment of pathological speech usually follows two
main steps: (1) extraction of pathology-specific features; (2) classification
or regression on extracted features. Given the great variety of speech and
language disorders, feature design is never a straightforward task, and yet it
is most crucial to the performance of assessment. This paper presents an
end-to-end approach to automatic speech assessment for Cantonese-speaking
People With Aphasia (PWA). The assessment is formulated as a binary
classification task to discriminate PWA with high scores of subjective
assessment from those with low scores. The sequence-to-one Recurrent Neural
Network with Gated Recurrent Unit (GRU-RNN) and Convolutional Neural Network
(CNN) models are applied to realize the end-to-end mapping from fundamental
speech features to the classification result. The pathology-specific features
used for assessment can be learned implicitly by the neural network model.
Class Activation Mapping (CAM) method is utilized to visualize how those
features contribute to the assessment result. Our experimental results show
that the end-to-end approach outperforms the conventional two-step approach in
the classification task, and confirm that the CNN model is able to learn
impairment-related features that are similar to human-designed features. The
experimental results also suggest that CNN model performs better than
sequence-to-one GRU-RNN model in this specific task
Cell-aware Stacked LSTMs for Modeling Sentences
We propose a method of stacking multiple long short-term memory (LSTM) layers
for modeling sentences. In contrast to the conventional stacked LSTMs where
only hidden states are fed as input to the next layer, the suggested
architecture accepts both hidden and memory cell states of the preceding layer
and fuses information from the left and the lower context using the soft gating
mechanism of LSTMs. Thus the architecture modulates the amount of information
to be delivered not only in horizontal recurrence but also in vertical
connections, from which useful features extracted from lower layers are
effectively conveyed to upper layers. We dub this architecture Cell-aware
Stacked LSTM (CAS-LSTM) and show from experiments that our models bring
significant performance gain over the standard LSTMs on benchmark datasets for
natural language inference, paraphrase detection, sentiment classification, and
machine translation. We also conduct extensive qualitative analysis to
understand the internal behavior of the suggested approach.Comment: ACML 201
Delving Deeper into Convolutional Networks for Learning Video Representations
We propose an approach to learn spatio-temporal features in videos from
intermediate visual representations we call "percepts" using
Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts
that are extracted from all level of a deep convolutional network trained on
the large ImageNet dataset. While high-level percepts contain highly
discriminative information, they tend to have a low-spatial resolution.
Low-level percepts, on the other hand, preserve a higher spatial resolution
from which we can model finer motion patterns. Using low-level percepts can
leads to high-dimensionality video representations. To mitigate this effect and
control the model number of parameters, we introduce a variant of the GRU model
that leverages the convolution operations to enforce sparse connectivity of the
model units and share parameters across the input spatial locations.
We empirically validate our approach on both Human Action Recognition and
Video Captioning tasks. In particular, we achieve results equivalent to
state-of-art on the YouTube2Text dataset using a simpler text-decoder model and
without extra 3D CNN features.Comment: ICLR 201
- …