3,492 research outputs found
Sequence Training and Adaptation of Highway Deep Neural Networks
Highway deep neural network (HDNN) is a type of depth-gated feedforward
neural network, which has shown to be easier to train with more hidden layers
and also generalise better compared to conventional plain deep neural networks
(DNNs). Previously, we investigated a structured HDNN architecture for speech
recognition, in which the two gate functions were tied across all the hidden
layers, and we were able to train a much smaller model without sacrificing the
recognition accuracy. In this paper, we carry on the study of this architecture
with sequence-discriminative training criterion and speaker adaptation
techniques on the AMI meeting speech recognition corpus. We show that these two
techniques improve speech recognition accuracy on top of the model trained with
the cross entropy criterion. Furthermore, we demonstrate that the two gate
functions that are tied across all the hidden layers are able to control the
information flow over the whole network, and we can achieve considerable
improvements by only updating these gate functions in both sequence training
and adaptation experiments.Comment: 6 pages, 3 figures, published at IEEE SLT 2016. arXiv admin note:
text overlap with arXiv:1610.0581
ACE: Adapting to Changing Environments for Semantic Segmentation
Deep neural networks exhibit exceptional accuracy when they are trained and
tested on the same data distributions. However, neural classifiers are often
extremely brittle when confronted with domain shift---changes in the input
distribution that occur over time. We present ACE, a framework for semantic
segmentation that dynamically adapts to changing environments over the time. By
aligning the distribution of labeled training data from the original source
domain with the distribution of incoming data in a shifted domain, ACE
synthesizes labeled training data for environments as it sees them. This
stylized data is then used to update a segmentation model so that it performs
well in new environments. To avoid forgetting knowledge from past environments,
we introduce a memory that stores feature statistics from previously seen
domains. These statistics can be used to replay images in any of the previously
observed domains, thus preventing catastrophic forgetting. In addition to
standard batch training using stochastic gradient decent (SGD), we also
experiment with fast adaptation methods based on adaptive meta-learning.
Extensive experiments are conducted on two datasets from SYNTHIA, the results
demonstrate the effectiveness of the proposed approach when adapting to a
number of tasks
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Semi-tied Units for Efficient Gating in LSTM and Highway Networks
Gating is a key technique used for integrating information from multiple
sources by long short-term memory (LSTM) models and has recently also been
applied to other models such as the highway network. Although gating is
powerful, it is rather expensive in terms of both computation and storage as
each gating unit uses a separate full weight matrix. This issue can be severe
since several gates can be used together in e.g. an LSTM cell. This paper
proposes a semi-tied unit (STU) approach to solve this efficiency issue, which
uses one shared weight matrix to replace those in all the units in the same
layer. The approach is termed "semi-tied" since extra parameters are used to
separately scale each of the shared output values. These extra scaling factors
are associated with the network activation functions and result in the use of
parameterised sigmoid, hyperbolic tangent, and rectified linear unit functions.
Speech recognition experiments using British English multi-genre broadcast data
showed that using STUs can reduce the calculation and storage cost by a factor
of three for highway networks and four for LSTMs, while giving similar word
error rates to the original models.Comment: To appear in Proc. INTERSPEECH 2018, September 2-6, 2018, Hyderabad,
Indi
Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling
Recurrent neural networks have shown remarkable success in modeling
sequences. However low resource situations still adversely affect the
generalizability of these models. We introduce a new family of models, called
Lattice Recurrent Units (LRU), to address the challenge of learning deep
multi-layer recurrent models with limited resources. LRU models achieve this
goal by creating distinct (but coupled) flow of information inside the units: a
first flow along time dimension and a second flow along depth dimension. It
also offers a symmetry in how information can flow horizontally and vertically.
We analyze the effects of decoupling three different components of our LRU
model: Reset Gate, Update Gate and Projected State. We evaluate this family on
new LRU models on computational convergence rates and statistical efficiency.
Our experiments are performed on four publicly-available datasets, comparing
with Grid-LSTM and Recurrent Highway networks. Our results show that LRU has
better empirical computational convergence rates and statistical efficiency
values, along with learning more accurate language models.Comment: 8 pages, 7 figure
Multi-Cast Attention Networks for Retrieval-based Question Answering and Response Prediction
Attention is typically used to select informative sub-phrases that are used
for prediction. This paper investigates the novel use of attention as a form of
feature augmentation, i.e, casted attention. We propose Multi-Cast Attention
Networks (MCAN), a new attention mechanism and general model architecture for a
potpourri of ranking tasks in the conversational modeling and question
answering domains. Our approach performs a series of soft attention operations,
each time casting a scalar feature upon the inner word embeddings. The key idea
is to provide a real-valued hint (feature) to a subsequent encoder layer and is
targeted at improving the representation learning process. There are several
advantages to this design, e.g., it allows an arbitrary number of attention
mechanisms to be casted, allowing for multiple attention types (e.g.,
co-attention, intra-attention) and attention variants (e.g., alignment-pooling,
max-pooling, mean-pooling) to be executed simultaneously. This not only
eliminates the costly need to tune the nature of the co-attention layer, but
also provides greater extents of explainability to practitioners. Via extensive
experiments on four well-known benchmark datasets, we show that MCAN achieves
state-of-the-art performance. On the Ubuntu Dialogue Corpus, MCAN outperforms
existing state-of-the-art models by . MCAN also achieves the best
performing score to date on the well-studied TrecQA dataset.Comment: Accepted to KDD 2018 (Paper titled only "Multi-Cast Attention
Networks" in KDD version
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
In this work we propose a novel interpretation of residual networks showing
that they can be seen as a collection of many paths of differing length.
Moreover, residual networks seem to enable very deep networks by leveraging
only the short paths during training. To support this observation, we rewrite
residual networks as an explicit collection of paths. Unlike traditional
models, paths through residual networks vary in length. Further, a lesion study
reveals that these paths show ensemble-like behavior in the sense that they do
not strongly depend on each other. Finally, and most surprising, most paths are
shorter than one might expect, and only the short paths are needed during
training, as longer paths do not contribute any gradient. For example, most of
the gradient in a residual network with 110 layers comes from paths that are
only 10-34 layers deep. Our results reveal one of the key characteristics that
seem to enable the training of very deep networks: Residual networks avoid the
vanishing gradient problem by introducing short paths which can carry gradient
throughout the extent of very deep networks.Comment: NIPS 201
Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation
Domain mismatch between training and testing can lead to significant
degradation in performance in many machine learning scenarios. Unfortunately,
this is not a rare situation for automatic speech recognition deployments in
real-world applications. Research on robust speech recognition can be regarded
as trying to overcome this domain mismatch issue. In this paper, we address the
unsupervised domain adaptation problem for robust speech recognition, where
both source and target domain speech are presented, but word transcripts are
only available for the source domain speech. We present novel
augmentation-based methods that transform speech in a way that does not change
the transcripts. Specifically, we first train a variational autoencoder on both
source and target domain data (without supervision) to learn a latent
representation of speech. We then transform nuisance attributes of speech that
are irrelevant to recognition by modifying the latent representations, in order
to augment labeled training data with additional data whose distribution is
more similar to the target domain. The proposed method is evaluated on the
CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as
35% compared to the non-adapted baseline.Comment: Accepted to IEEE ASRU 201
Character-Aware Neural Language Models
We describe a simple neural language model that relies only on
character-level inputs. Predictions are still made at the word-level. Our model
employs a convolutional neural network (CNN) and a highway network over
characters, whose output is given to a long short-term memory (LSTM) recurrent
neural network language model (RNN-LM). On the English Penn Treebank the model
is on par with the existing state-of-the-art despite having 60% fewer
parameters. On languages with rich morphology (Arabic, Czech, French, German,
Spanish, Russian), the model outperforms word-level/morpheme-level LSTM
baselines, again with fewer parameters. The results suggest that on many
languages, character inputs are sufficient for language modeling. Analysis of
word representations obtained from the character composition part of the model
reveals that the model is able to encode, from characters only, both semantic
and orthographic information.Comment: AAAI 201
ISA: Intelligent Speed Adaptation from Appearance
In this work we introduce a new problem named Intelligent Speed Adaptation
from Appearance (ISA). Technically, the goal of an ISA model is to
predict for a given image of a driving scenario the proper speed of the
vehicle. Note this problem is different from predicting the actual speed of the
vehicle. It defines a novel regression problem where the appearance information
has to be directly mapped to get a prediction for the speed at which the
vehicle should go, taking into account the traffic situation. First, we release
a novel dataset for the new problem, where multiple driving video sequences,
with the annotated adequate speed per frame, are provided. We then introduce
two deep learning based ISA models, which are trained to perform the final
regression of the proper speed given a test image. We end with a thorough
experimental validation where the results show the level of difficulty of the
proposed task. The dataset and the proposed models will all be made publicly
available to encourage much needed further research on this problem.Comment: IROS 2018 Workshop: 10th Planning, Perception and Navigation for
Intelligent Vehicles (PPNIV'18
- …