1,571 research outputs found
A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models
Training large vocabulary Neural Network Language Models (NNLMs) is a
difficult task due to the explicit requirement of the output layer
normalization, which typically involves the evaluation of the full softmax
function over the complete vocabulary. This paper proposes a Batch Noise
Contrastive Estimation (B-NCE) approach to alleviate this problem. This is
achieved by reducing the vocabulary, at each time step, to the target words in
the batch and then replacing the softmax by the noise contrastive estimation
approach, where these words play the role of targets and noise samples at the
same time. In doing so, the proposed approach can be fully formulated and
implemented using optimal dense matrix operations. Applying B-NCE to train
different NNLMs on the Large Text Compression Benchmark (LTCB) and the One
Billion Word Benchmark (OBWB) shows a significant reduction of the training
time with no noticeable degradation of the models performance. This paper also
presents a new baseline comparative study of different standard NNLMs on the
large OBWB on a single Titan-X GPU.Comment: Accepted for publication at INTERSPEECH'1
Exploring the Limits of Language Modeling
In this work we explore recent advances in Recurrent Neural Networks for
large scale Language Modeling, a task central to language understanding. We
extend current models to deal with two key challenges present in this task:
corpora and vocabulary sizes, and complex, long term structure of language. We
perform an exhaustive study on techniques such as character Convolutional
Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark.
Our best single model significantly improves state-of-the-art perplexity from
51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20),
while an ensemble of models sets a new record by improving perplexity from 41.0
down to 23.7. We also release these models for the NLP and ML community to
study and improve upon
Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction
This paper presents a method for constructing human-robot interaction
policies in settings where multimodality, i.e., the possibility of multiple
highly distinct futures, plays a critical role in decision making. We are
motivated in this work by the example of traffic weaving, e.g., at highway
on-ramps/off-ramps, where entering and exiting cars must swap lanes in a short
distance---a challenging negotiation even for experienced drivers due to the
inherent multimodal uncertainty of who will pass whom. Our approach is to learn
multimodal probability distributions over future human actions from a dataset
of human-human exemplars and perform real-time robot policy construction in the
resulting environment model through massively parallel sampling of human
responses to candidate robot action sequences. Direct learning of these
distributions is made possible by recent advances in the theory of conditional
variational autoencoders (CVAEs), whereby we learn action distributions
simultaneously conditioned on the present interaction history, as well as
candidate future robot actions in order to take into account response dynamics.
We demonstrate the efficacy of this approach with a human-in-the-loop
simulation of a traffic weaving scenario
Sequential Gaussian Processes for Online Learning of Nonstationary Functions
Many machine learning problems can be framed in the context of estimating
functions, and often these are time-dependent functions that are estimated in
real-time as observations arrive. Gaussian processes (GPs) are an attractive
choice for modeling real-valued nonlinear functions due to their flexibility
and uncertainty quantification. However, the typical GP regression model
suffers from several drawbacks: i) Conventional GP inference scales
with respect to the number of observations; ii) updating a GP model
sequentially is not trivial; and iii) covariance kernels often enforce
stationarity constraints on the function, while GPs with non-stationary
covariance kernels are often intractable to use in practice. To overcome these
issues, we propose an online sequential Monte Carlo algorithm to fit mixtures
of GPs that capture non-stationary behavior while allowing for fast,
distributed inference. By formulating hyperparameter optimization as a
multi-armed bandit problem, we accelerate mixing for real time inference. Our
approach empirically improves performance over state-of-the-art methods for
online GP estimation in the context of prediction for simulated non-stationary
data and hospital time series data
Deep Learning on Graphs: A Survey
Deep learning has been shown to be successful in a number of domains, ranging
from acoustics, images, to natural language processing. However, applying deep
learning to the ubiquitous graph data is non-trivial because of the unique
characteristics of graphs. Recently, substantial research efforts have been
devoted to applying deep learning methods to graphs, resulting in beneficial
advances in graph analysis techniques. In this survey, we comprehensively
review the different types of deep learning methods on graphs. We divide the
existing methods into five categories based on their model architectures and
training strategies: graph recurrent neural networks, graph convolutional
networks, graph autoencoders, graph reinforcement learning, and graph
adversarial methods. We then provide a comprehensive overview of these methods
in a systematic manner mainly by following their development history. We also
analyze the differences and compositions of different methods. Finally, we
briefly outline the applications in which they have been used and discuss
potential future research directions.Comment: Accepted by Transactions on Knowledge and Data Engineering. 24 pages,
11 figure
Investigating Linguistic Pattern Ordering in Hierarchical Natural Language Generation
Natural language generation (NLG) is a critical component in spoken dialogue
system, which can be divided into two phases: (1) sentence planning: deciding
the overall sentence structure, (2) surface realization: determining specific
word forms and flattening the sentence structure into a string. With the rise
of deep learning, most modern NLG models are based on a sequence-to-sequence
(seq2seq) model, which basically contains an encoder-decoder structure; these
NLG models generate sentences from scratch by jointly optimizing sentence
planning and surface realization. However, such simple encoder-decoder
architecture usually fail to generate complex and long sentences, because the
decoder has difficulty learning all grammar and diction knowledge well. This
paper introduces an NLG model with a hierarchical attentional decoder, where
the hierarchy focuses on leveraging linguistic knowledge in a specific order.
The experiments show that the proposed method significantly outperforms the
traditional seq2seq model with a smaller model size, and the design of the
hierarchical attentional decoder can be applied to various NLG systems.
Furthermore, different generation strategies based on linguistic patterns are
investigated and analyzed in order to guide future NLG research work.Comment: accepted by the 7th IEEE Workshop on Spoken Language Technology (SLT
2018). arXiv admin note: text overlap with arXiv:1808.0274
A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property
We present an architecture of a recurrent neural network (RNN) with a
fully-connected deep neural network (DNN) as its feature extractor. The RNN is
equipped with both causal temporal prediction and non-causal look-ahead, via
auto-regression (AR) and moving-average (MA), respectively. The focus of this
paper is a primal-dual training method that formulates the learning of the RNN
as a formal optimization problem with an inequality constraint that provides a
sufficient condition for the stability of the network dynamics. Experimental
results demonstrate the effectiveness of this new method, which achieves 18.86%
phone recognition error on the TIMIT benchmark for the core test set. The
result approaches the best result of 17.7%, which was obtained by using RNN
with long short-term memory (LSTM). The results also show that the proposed
primal-dual training method produces lower recognition errors than the popular
RNN methods developed earlier based on the carefully tuned threshold parameter
that heuristically prevents the gradient from exploding
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
Sockeye: A Toolkit for Neural Machine Translation
We describe Sockeye (version 1.12), an open-source sequence-to-sequence
toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready
framework for training and applying models as well as an experimental platform
for researchers. Written in Python and built on MXNet, the toolkit offers
scalable training and inference for the three most prominent encoder-decoder
architectures: attentional recurrent neural networks, self-attentional
transformers, and fully convolutional networks. Sockeye also supports a wide
range of optimizers, normalization and regularization techniques, and inference
improvements from current NMT literature. Users can easily run standard
training recipes, explore different model settings, and incorporate new ideas.
In this paper, we highlight Sockeye's features and benchmark it against other
NMT toolkits on two language arcs from the 2017 Conference on Machine
Translation (WMT): English-German and Latvian-English. We report competitive
BLEU scores across all three architectures, including an overall best score for
Sockeye's transformer implementation. To facilitate further comparison, we
release all system outputs and training scripts used in our experiments. The
Sockeye toolkit is free software released under the Apache 2.0 license
A Survey on Neural Network Language Models
As the core component of Natural Language Processing (NLP) system, Language
Model (LM) can provide word representation and probability indication of word
sequences. Neural Network Language Models (NNLMs) overcome the curse of
dimensionality and improve the performance of traditional LMs. A survey on
NNLMs is performed in this paper. The structure of classic NNLMs is described
firstly, and then some major improvements are introduced and analyzed. We
summarize and compare corpora and toolkits of NNLMs. Further, some research
directions of NNLMs are discussed
- …