42 research outputs found
Principled Approaches to Automatic Text Summarization
Automatic text summarization is a particularly challenging Natural Language Processing (NLP) task involving natural language understanding, content selection and natural language generation. In this thesis, we concentrate on the content selection aspect, the inherent problem of summarization which is controlled by the notion of information Importance.
We present a simple and intuitive formulation of the summarization task as two components: a summary scoring function θ measuring how good a text is as a summary of the given sources, and an optimization technique O extracting a summary with a high score according to θ. This perspective offers interesting insights over previous summarization efforts and allows us to pinpoint promising research directions. In particular, we realize that previous works heavily constrained the summary scoring function in order to solve convenient optimization problems (e.g., Integer Linear Programming). We question this assumption and demonstrate that General Purpose Optimization (GPO) techniques like genetic algorithms are practical. These GPOs do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints.
Additionally, the summary scoring function can be evaluated on its own based on its ability to correlate with humans. This offers a principled way of examining the inner workings of summarization systems and complements the traditional evaluations of the extracted summaries. In fact, evaluation metrics are also summary scoring functions which should correlate well with humans. Thus, the two main challenges of summarization, the evaluation and the development of summarizers, are unified within the same setup: discovering strong summary scoring functions. Hence, we investigated ways of uncovering such functions.
First, we conducted an empirical study of learning the summary scoring function from data. The results show that an unconstrained summary scoring function is better able to correlate with humans. Furthermore, an unconstrained summary scoring function optimized approximately with GPO extracts better summaries than a constrained summary scoring function optimized exactly with, e.g., ILP. Along the way, we proposed techniques to leverage the small and biased human judgment datasets. Additionally, we released a new evaluation metric explicitly trained to maximize its correlation with humans.
Second, we developed a theoretical formulation of the notion of Importance. In a framework rooted in information theory, we defined the quantities: Redundancy, Relevance and Informativeness. Importance arises as the notion unifying these concepts. More generally, Importance is the measure that guides which choices to make when information must be discarded.
Finally, evaluation remains an open-problem with a massive impact on summarization progress. Thus, we conducted experiments on available human judgment datasets commonly used to compare evaluation metrics. We discovered that these datasets do not cover the high-quality range in which summarization systems and evaluation metrics operate. This motivates efforts to collect human judgments for high-scoring summaries as this would be necessary to settle the debate over which metric to use. This would also be greatly beneficial for improving summarization systems and metrics alike
Descartes: Generating Short Descriptions of Wikipedia Articles
Wikipedia is one of the richest knowledge sources on the Web today. In order
to facilitate navigating, searching, and maintaining its content, Wikipedia's
guidelines state that all articles should be annotated with a so-called short
description indicating the article's topic (e.g., the short description of beer
is "Alcoholic drink made from fermented cereal grains"). Nonetheless, a large
fraction of articles (ranging from 10.2% in Dutch to 99.7% in Kazakh) have no
short description yet, with detrimental effects for millions of Wikipedia
users. Motivated by this problem, we introduce the novel task of automatically
generating short descriptions for Wikipedia articles and propose Descartes, a
multilingual model for tackling it. Descartes integrates three sources of
information to generate an article description in a target language: the text
of the article in all its language versions, the already-existing descriptions
(if any) of the article in other languages, and semantic type information
obtained from a knowledge graph. We evaluate a Descartes model trained for
handling 25 languages simultaneously, showing that it beats baselines
(including a strong translation-based baseline) and performs on par with
monolingual models tailored for specific languages. A human evaluation on three
languages further shows that the quality of Descartes's descriptions is largely
indistinguishable from that of human-written descriptions; e.g., 91.3% of our
English descriptions (vs. 92.1% of human-written descriptions) pass the bar for
inclusion in Wikipedia, suggesting that Descartes is ready for production, with
the potential to support human editors in filling a major gap in today's
Wikipedia across languages
Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations
Average word embeddings are a common baseline for more sophisticated sentence
embedding techniques. However, they typically fall short of the performances of
more complex models such as InferSent. Here, we generalize the concept of
average word embeddings to power mean word embeddings. We show that the
concatenation of different types of power mean word embeddings considerably
closes the gap to state-of-the-art methods monolingually and substantially
outperforms these more complex techniques cross-lingually. In addition, our
proposed method outperforms different recently proposed baselines such as SIF
and Sent2Vec by a solid margin, thus constituting a much harder-to-beat
monolingual baseline. Our data and code are publicly available.Comment: Experiments/plots added: Normalization + Figure 1 (dimensionality vs.
performance
Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction
Large language models (LLMs) have great potential for synthetic data
generation. This work shows that useful data can be synthetically generated
even for tasks that cannot be solved directly by LLMs: for problems with
structured outputs, it is possible to prompt an LLM to perform the task in the
reverse direction, by generating plausible input text for a target output
structure. Leveraging this asymmetry in task difficulty makes it possible to
produce large-scale, high-quality data for complex tasks. We demonstrate the
effectiveness of this approach on closed information extraction, where
collecting ground-truth data is challenging, and no satisfactory dataset exists
to date. We synthetically generate a dataset of 1.8M data points, establish its
superior quality compared to existing datasets in a human evaluation, and use
it to finetune small models (220M and 770M parameters), termed SynthIE, that
outperform the prior state of the art (with equal model size) by a substantial
margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data,
and models are available at https://github.com/epfl-dlab/SynthIE.Comment: Accepted at EMNLP 202
Live Blog Corpus for Summarization
Live blogs are an increasingly popular news format to cover breaking news and
live events in online journalism. Online news websites around the world are
using this medium to give their readers a minute by minute update on an event.
Good summaries enhance the value of the live blogs for a reader but are often
not available. In this paper, we study a way of collecting corpora for
automatic live blog summarization. In an empirical evaluation using well-known
state-of-the-art summarization systems, we show that live blogs corpus poses
new challenges in the field of summarization. We make our tools publicly
available to reconstruct the corpus to encourage the research community and
replicate our results.Comment: To appear in the Proceedings of LREC 201
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Automatic evaluation metrics capable of replacing human judgments are
critical to allowing fast development of new methods. Thus, numerous research
efforts have focused on crafting such metrics. In this work, we take a step
back and analyze recent progress by comparing the body of existing automatic
metrics and human metrics altogether. As metrics are used based on how they
rank systems, we compare metrics in the space of system rankings. Our extensive
statistical analysis reveals surprising findings: automatic metrics -- old and
new -- are much more similar to each other than to humans. Automatic metrics
are not complementary and rank systems similarly. Strikingly, human metrics
predict each other much better than the combination of all automatic metrics
used to predict a human metric. It is surprising because human metrics are
often designed to be independent, to capture different aspects of quality, e.g.
content fidelity or readability. We provide a discussion of these findings and
recommendations for future work in the field of evaluation