7 research outputs found
MegaWika: Millions of reports and their sources across 50 diverse languages
To foster the development of new models for collaborative AI-assisted report
generation, we introduce MegaWika, consisting of 13 million Wikipedia articles
in 50 diverse languages, along with their 71 million referenced source
materials. We process this dataset for a myriad of applications, going beyond
the initial Wikipedia citation extraction and web scraping of content,
including translating non-English articles for cross-lingual applications and
providing FrameNet parses for automated semantic analysis. MegaWika is the
largest resource for sentence-level report generation and the only report
generation dataset that is multilingual. We manually analyze the quality of
this resource through a semantically stratified sample. Finally, we provide
baseline results and trained models for crucial steps in automated report
generation: cross-lingual question answering and citation retrieval.Comment: Submitted to ACL, 202
Improving Semantic Parsing Using Statistical Word Sense Disambiguation (Student Abstract)
A Semantic Parser generates a logical form graph from an utterance where the edges are semantic roles and nodes are word senses in an ontology that supports reasoning. The generated representation attempts to capture the full meaning of the utterance. While the process of parsing works to resolve lexical ambiguity, a number of errors in the logical forms arise from incorrectly assigned word sense determinations. This is especially true in logical and rule-based semantic parsers. Although the performance of statistical word sense disambiguation methods is superior to the word sense output of semantic parser, these systems do not produce the rich role structure or a detailed semantic representation of the sentence content. In this work, we use decisions from a statistical WSD system to inform a logical semantic parser and greatly improve semantic type assignments in the resulting logical forms
On Event Individuation for Document-Level Information Extraction
As information extraction (IE) systems have grown more adept at processing
whole documents, the classic task of template filling has seen renewed interest
as benchmark for document-level IE. In this position paper, we call into
question the suitability of template filling for this purpose. We argue that
the task demands definitive answers to thorny questions of event individuation
-- the problem of distinguishing distinct events -- about which even human
experts disagree. Through an annotation study and error analysis, we show that
this raises concerns about the usefulness of template filling metrics, the
quality of datasets for the task, and the ability of models to learn it.
Finally, we consider possible solutions