3,543 research outputs found
Controlling Output Length in Neural Encoder-Decoders
Neural encoder-decoder models have shown great success in many sequence
generation tasks. However, previous work has not investigated situations in
which we would like to control the length of encoder-decoder outputs. This
capability is crucial for applications such as text summarization, in which we
have to generate concise summaries with a desired length. In this paper, we
propose methods for controlling the output sequence length for neural
encoder-decoder models: two decoding-based methods and two learning-based
methods. Results show that our learning-based methods have the capability to
control length without degrading summary quality in a summarization task.Comment: 11 pages. To appear in EMNLP 201
Language as a Latent Variable: Discrete Generative Models for Sentence Compression
In this work we explore deep generative models of text in which the latent
representation of a document is itself drawn from a discrete language model
distribution. We formulate a variational auto-encoder for inference in this
model and apply it to the task of compressing sentences. In this application
the generative model first draws a latent summary sentence from a background
language model, and then subsequently draws the observed sentence conditioned
on this latent summary. In our empirical evaluation we show that generative
formulations of both abstractive and extractive compression yield
state-of-the-art results when trained on a large amount of supervised data.
Further, we explore semi-supervised compression scenarios where we show that it
is possible to achieve performance competitive with previously proposed
supervised models while training on a fraction of the supervised data.Comment: EMNLP 201
Evaluating two methods for Treebank grammar compaction
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar.
In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
- …