217 research outputs found
Towards Neural Machine Translation with Latent Tree Attention
Building models that take advantage of the hierarchical structure of language
without a priori annotation is a longstanding goal in natural language
processing. We introduce such a model for the task of machine translation,
pairing a recurrent neural network grammar encoder with a novel attentional
RNNG decoder and applying policy gradient reinforcement learning to induce
unsupervised tree structures on both the source and target. When trained on
character-level datasets with no explicit segmentation or parse annotation, the
model learns a plausible segmentation and shallow parse, obtaining performance
close to an attentional baseline.Comment: Presented at SPNLP 201
Learning when to skim and when to read
Many recent advances in deep learning for natural language processing have
come at increasing computational cost, but the power of these state-of-the-art
models is not needed for every example in a dataset. We demonstrate two
approaches to reducing unnecessary computation in cases where a fast but weak
baseline classier and a stronger, slower model are both available. Applying an
AUC-based metric to the task of sentiment classification, we find significant
efficiency gains with both a probability-threshold method for reducing
computational cost and one that uses a secondary decision network.Comment: 8 pages (4 article, 1 references, 3 appendix), 11 figures, 3 tables,
published at ACL2017 workshop Repl4NL
Improving End-to-End Speech Recognition with Policy Learning
Connectionist temporal classification (CTC) is widely used for maximum
likelihood learning in end-to-end speech recognition models. However, there is
usually a disparity between the negative maximum likelihood and the performance
metric used in speech recognition, e.g., word error rate (WER). This results in
a mismatch between the objective function and metric during training. We show
that the above problem can be mitigated by jointly training with maximum
likelihood and policy gradient. In particular, with policy learning we are able
to directly optimize on the (otherwise non-differentiable) performance metric.
We show that joint training improves relative performance by 4% to 13% for our
end-to-end model as compared to the same model learned through maximum
likelihood. The model achieves 5.53% WER on Wall Street Journal dataset, and
5.42% and 14.70% on Librispeech test-clean and test-other set, respectively
- …