40 research outputs found
Attention-based Extraction of Structured Information from Street View Imagery
We present a neural network model - based on CNNs, RNNs and a novel attention
mechanism - which achieves 84.2% accuracy on the challenging French Street Name
Signs (FSNS) dataset, significantly outperforming the previous state of the art
(Smith'16), which achieved 72.46%. Furthermore, our new method is much simpler
and more general than the previous approach. To demonstrate the generality of
our model, we show that it also performs well on an even more challenging
dataset derived from Google Street View, in which the goal is to extract
business names from store fronts. Finally, we study the speed/accuracy tradeoff
that results from using CNN feature extractors of different depths.
Surprisingly, we find that deeper is not always better (in terms of accuracy,
as well as speed). Our resulting model is simple, accurate and fast, allowing
it to be used at scale on a variety of challenging real-world text extraction
problems.Comment: Updated references, added link to the source cod
Analysing Dropout and Compounding Errors in Neural Language Models
This paper carries out an empirical analysis of various dropout techniques
for language modelling, such as Bernoulli dropout, Gaussian dropout, Curriculum
Dropout, Variational Dropout and Concrete Dropout. Moreover, we propose an
extension of variational dropout to concrete dropout and curriculum dropout
with varying schedules. We find these extensions to perform well when compared
to standard dropout approaches, particularly variational curriculum dropout
with a linear schedule. Largest performance increases are made when applying
dropout on the decoder layer. Lastly, we analyze where most of the errors occur
at test time as a post-analysis step to determine if the well-known problem of
compounding errors is apparent and to what end do the proposed methods mitigate
this issue for each dataset. We report results on a 2-hidden layer LSTM, GRU
and Highway network with embedding dropout, dropout on the gated hidden layers
and the output projection layer for each model. We report our results on
Penn-TreeBank and WikiText-2 word-level language modelling datasets, where the
former reduces the long-tail distribution through preprocessing and one which
preserves rare words in the training and test set
On Controlled DeEntanglement for Natural Language Processing
Latest addition to the toolbox of human species is Artificial
Intelligence(AI). Thus far, AI has made significant progress in low stake low
risk scenarios such as playing Go and we are currently in a transition toward
medium stake scenarios such as Visual Dialog. In my thesis, I argue that we
need to incorporate controlled de-entanglement as first class object to succeed
in this transition. I present mathematical analysis from information theory to
show that employing stochasticity leads to controlled de-entanglement of
relevant factors of variation at various levels. Based on this, I highlight
results from initial experiments that depict efficacy of the proposed
framework. I conclude this writeup by a roadmap of experiments that show the
applicability of this framework to scalability, flexibility and
interpretibility
Connecting the Dots Between MLE and RL for Sequence Prediction
Sequence prediction models can be learned from example sequences with a
variety of training algorithms. Maximum likelihood learning is simple and
efficient, yet can suffer from compounding error at test time. Reinforcement
learning such as policy gradient addresses the issue but can have prohibitively
poor exploration efficiency. A rich set of other algorithms such as RAML, SPG,
and data noising, have also been developed from different perspectives. This
paper establishes a formal connection between these algorithms. We present a
generalized entropy regularized policy optimization formulation, and show that
the apparently distinct algorithms can all be reformulated as special instances
of the framework, with the only difference being the configurations of a reward
function and a couple of hyperparameters. The unified interpretation offers a
systematic view of the varying properties of exploration and learning
efficiency. Besides, inspired from the framework, we present a new algorithm
that dynamically interpolates among the family of algorithms for scheduled
sequence model learning. Experiments on machine translation, text
summarization, and game imitation learning demonstrate the superiority of the
proposed algorithm.Comment: Major revision. The first two authors contributed equall
A Deep Reinforced Model for Abstractive Summarization
Attentional, RNN-based encoder-decoder models for abstractive summarization
have achieved good performance on short input and output sequences. For longer
documents and summaries however these models often include repetitive and
incoherent phrases. We introduce a neural network model with a novel
intra-attention that attends over the input and continuously generated output
separately, and a new training method that combines standard supervised word
prediction and reinforcement learning (RL). Models trained only with supervised
learning often exhibit "exposure bias" - they assume ground truth is provided
at each step during training. However, when standard word prediction is
combined with the global sequence prediction training of RL the resulting
summaries become more readable. We evaluate this model on the CNN/Daily Mail
and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the
CNN/Daily Mail dataset, an improvement over previous state-of-the-art models.
Human evaluation also shows that our model produces higher quality summaries
Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer
In this work, we develop a novel regularizer to improve the learning of
long-range dependency of sequence data. Applied on language modelling, our
regularizer expresses the inductive bias that sequence variables should have
high mutual information even though the model might not see abundant
observations for complex long-range dependency. We show how the `next sentence
prediction (classification)' heuristic can be derived in a principled way from
our mutual information estimation framework, and be further extended to
maximize the mutual information of sequence variables. The proposed approach
not only is effective at increasing the mutual information of segments under
the learned model but more importantly, leads to a higher likelihood on holdout
data, and improved generation quality. Code is released at
https://github.com/BorealisAI/BMI.Comment: Camera-ready for AISTATS 202
Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation
We introduce Texar, an open-source toolkit aiming to support the broad set of
text generation tasks that transform any inputs into natural language, such as
machine translation, summarization, dialog, content manipulation, and so forth.
With the design goals of modularity, versatility, and extensibility in mind,
Texar extracts common patterns underlying the diverse tasks and methodologies,
creates a library of highly reusable modules, and allows arbitrary model
architectures and algorithmic paradigms. In Texar, model architecture,
inference, and learning processes are properly decomposed. Modules at a high
concept level can be freely assembled and plugged in/swapped out. The toolkit
also supports a rich set of large-scale pretrained models. Texar is thus
particularly suitable for researchers and practitioners to do fast prototyping
and experimentation. The versatile toolkit also fosters technique sharing
across different text generation tasks. Texar supports both TensorFlow and
PyTorch, and is released under Apache License 2.0 at https://www.texar.io.Comment: ACL 2019 demo, expanded versio
Cold-Start Reinforcement Learning with Softmax Policy Gradient
Policy-gradient approaches to reinforcement learning have two common and
undesirable overhead procedures, namely warm-start training and sample variance
reduction. In this paper, we describe a reinforcement learning method based on
a softmax value function that requires neither of these procedures. Our method
combines the advantages of policy-gradient methods with the efficiency and
simplicity of maximum-likelihood approaches. We apply this new cold-start
reinforcement learning method in training sequence generation models for
structured output prediction problems. Empirical evidence validates this method
on automatic summarization and image captioning tasks.Comment: Conference on Neural Information Processing Systems 2017. Main paper
and supplementary materia
Context-Dependent Semantic Parsing over Temporally Structured Data
We describe a new semantic parsing setting that allows users to query the
system using both natural language questions and actions within a graphical
user interface. Multiple time series belonging to an entity of interest are
stored in a database and the user interacts with the system to obtain a better
understanding of the entity's state and behavior, entailing sequences of
actions and questions whose answers may depend on previous factual or
navigational interactions. We design an LSTM-based encoder-decoder architecture
that models context dependency through copying mechanisms and multiple levels
of attention over inputs and previous outputs. When trained to predict tokens
using supervised learning, the proposed architecture substantially outperforms
standard sequence generation baselines. Training the architecture using policy
gradient leads to further improvements in performance, reaching a
sequence-level accuracy of 88.7% on artificial data and 74.8% on real data.Comment: Accepted by NAACL 2019 (Oral presentation
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy
optimization algorithms in reinforcement learning (RL). While current trust
region strategies are effective for continuous control, they typically require
a prohibitively large amount of on-policy interaction with the environment. To
address this problem, we propose an off-policy trust region method, Trust-PCL.
The algorithm is the result of observing that the optimal policy and state
values of a maximum reward objective with a relative-entropy regularizer
satisfy a set of multi-step pathwise consistencies along any path. Thus,
Trust-PCL is able to maintain optimization stability while exploiting
off-policy data to improve sample efficiency. When evaluated on a number of
continuous control tasks, Trust-PCL improves the solution quality and sample
efficiency of TRPO.Comment: ICLR 201