1,869 research outputs found
New Alignment Methods for Discriminative Book Summarization
We consider the unsupervised alignment of the full text of a book with a
human-written summary. This presents challenges not seen in other text
alignment problems, including a disparity in length and, consequent to this, a
violation of the expectation that individual words and phrases should align,
since large passages and chapters can be distilled into a single summary
phrase. We present two new methods, based on hidden Markov models, specifically
targeted to this problem, and demonstrate gains on an extractive book
summarization task. While there is still much room for improvement,
unsupervised alignment holds intrinsic value in offering insight into what
features of a book are deemed worthy of summarization.Comment: This paper reflects work in progres
An Empirical Comparison of Parsing Methods for Stanford Dependencies
Stanford typed dependencies are a widely desired representation of natural
language sentences, but parsing is one of the major computational bottlenecks
in text analysis systems. In light of the evolving definition of the Stanford
dependencies and developments in statistical dependency parsing algorithms,
this paper revisits the question of Cer et al. (2010): what is the tradeoff
between accuracy and speed in obtaining Stanford dependencies in particular? We
also explore the effects of input representations on this tradeoff:
part-of-speech tags, the novel use of an alternative dependency representation
as input, and distributional representaions of words. We find that direct
dependency parsing is a more viable solution than it was found to be in the
past. An accompanying software release can be found at:
http://www.ark.cs.cmu.edu/TBSDComment: 13 pages, 2 figure
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors
with overlapping, global features. Each input's latent representation is
predicted conditional on the observable data using a feature-rich conditional
random field. Then a reconstruction of the input is (re)generated, conditional
on the latent structure, using models for which maximum likelihood estimation
has a closed-form. Our autoencoder formulation enables efficient learning
without making unrealistic independence assumptions or restricting the kinds of
features that can be used. We illustrate insightful connections to traditional
autoencoders, posterior regularization and multi-view learning. We show
competitive results with instantiations of the model for two canonical NLP
tasks: part-of-speech induction and bitext word alignment, and show that
training our model can be substantially more efficient than comparable
feature-rich baselines
Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Understanding how ideas relate to each other is a fundamental question in
many domains, ranging from intellectual history to public communication.
Because ideas are naturally embedded in texts, we propose the first framework
to systematically characterize the relations between ideas based on their
occurrence in a corpus of documents, independent of how these ideas are
represented. Combining two statistics --- cooccurrence within documents and
prevalence correlation over time --- our approach reveals a number of different
ways in which ideas can cooperate and compete. For instance, two ideas can
closely track each other's prevalence over time, and yet rarely cooccur, almost
like a "cold war" scenario. We observe that pairwise cooccurrence and
prevalence correlation exhibit different distributions. We further demonstrate
that our approach is able to uncover intriguing relations between ideas through
in-depth case studies on news articles and research papers.Comment: 11 pages, 9 figures, to appear in Proceedings of ACL 2017, code and
data available at https://chenhaot.com/pages/idea-relations.html (fixed a
typo
- …