3 research outputs found
NEWTS: A Corpus for News Topic-Focused Summarization
Text summarization models are approaching human levels of fidelity. Existing
benchmarking corpora provide concordant pairs of full and abridged versions of
Web, news or, professional content. To date, all summarization datasets operate
under a one-size-fits-all paradigm that may not reflect the full range of
organic summarization needs. Several recently proposed models (e.g., plug and
play language models) have the capacity to condition the generated summaries on
a desired range of themes. These capacities remain largely unused and
unevaluated as there is no dedicated dataset that would support the task of
topic-focused summarization.
This paper introduces the first topical summarization corpus NEWTS, based on
the well-known CNN/Dailymail dataset, and annotated via online crowd-sourcing.
Each source article is paired with two reference summaries, each focusing on a
different theme of the source document. We evaluate a representative range of
existing techniques and analyze the effectiveness of different prompting
methods
Recommended from our members
The Anatomy of Discourse: Linguistic Predictors of Narrative and Argument Quality
Narratives (sequences of purposively related concrete situations) and arguments (reasoning and conclusions in an attempt to persuade) are distinct cornerstones of human discourse. While theories of their linguistic structures exist, it is unclear which theorized features influence perception of narrative and argument quality. Furthermore, differences in their usage over time and across formal versus informal mediums remain unexplored. Thus, we use an original dataset of news and Reddit discourse (consisted of >10,000 clauses), annotated for clause-level discourse elements (e.g., generic statements vs. events; Smith, 2003), and their coherence relations (e.g., cause/effect; Wolf & Gibson, 2005). We identify the features that correspond to differing perceptions of narrative and argument quality across multiple dimensions. Since the documents cover marijuana legalization discourse during a period of massive attitude shift in the U.S. (2008-2019), we also examine changes over time in discourse structure within this rapidly evolving sociopolitical context
Recommended from our members
Can computers tell a story? Discourse Structure in Computer-generated Text and Humans
Text-generation algorithms like GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) produce documents which resemble coherent human writing. But no study has compared the discourse linguistics features of the artificial text with that of comparable human content. We used a sample of Reddit and news discourse as prompts to generate artificial text using fine-tuned GPT-2 (Grover; Zellers et al., 2019). Blind annotators identified clause-level discourse features (e.g., states and events; Smith, 2003), and coherence relations (e.g., contrast; Wolf and Gibson, 2005) in prompts and generated text. Comparing the >20000 clauses, Grover recreates human word co-occurrence patterns and clause types across discourse modes. However, its coherence relations have shorter length and lower quality, with many nonsensical instances. Therefore, annotators could perfectly guess the human/algorithmic source of documents. Using a corresponding GPT-3 sample, we discuss aspects of generation that have and have not improved since Grover