5 research outputs found
SumREN: Summarizing Reported Speech about Events in News
A primary objective of news articles is to establish the factual record for an event, frequently achieved by conveying both the details of the specified event (i.e., the 5 Ws; Who, What, Where, When and Why regarding the event) and how people reacted to it (i.e., reported statements). However, existing work on news summarization almost exclusively focuses on the event details. In this work, we propose the novel task of summarizing the reactions of different speakers, as expressed by their reported statements, to a given event. To this end, we create a new multi-document summarization benchmark, SumREN, comprising 745 summaries of reported statements from various public figures obtained from 633 news articles discussing 132 events. We propose an automatic silver-training data generation approach for our task, which helps smaller models like BART achieve GPT-3 level performance on this task. Finally, we introduce a pipeline-based framework for summarizing reported speech, which we empirically show to generate summaries that are more abstractive and factual than baseline query-focused summarization approaches
PLAtE: A Large-scale Dataset for List Page Web Extraction
Recently, neural models have been leveraged to significantly improve the
performance of information extraction from semi-structured websites. However, a
barrier for continued progress is the small number of datasets large enough to
train these models. In this work, we introduce the PLAtE (Pages of Lists
Attribute Extraction) dataset as a challenging new web extraction task. PLAtE
focuses on shopping data, specifically extractions from product review pages
with multiple items. PLAtE encompasses both the tasks of: (1) finding
product-list segmentation boundaries and (2) extracting attributes for each
product. PLAtE is composed of 53, 905 items from 6, 810 pages, making it the
first large-scale list page web extraction dataset. We construct PLAtE by
collecting list pages from Common Crawl, then annotating them on Mechanical
Turk. Quantitative and qualitative analyses are performed to demonstrate PLAtE
has high-quality annotations. We establish strong baseline performance on PLAtE
with a SOTA model achieving an F1-score of 0.750 for attribute classification
and 0.915 for segmentation, indicating opportunities for future research
innovations in web extraction