AltGen: 1.3M Plausible Alternatives From Neural Text Generators

Fernández, Raquel; Giulianelli, Mario; Wallbridge, Sarenne

AltGen: 1.3M Plausible Alternatives From Neural Text Generators

Authors: Raquel Fernández
Mario Giulianelli
Sarenne Wallbridge
Publication date: 20 October 2023
Publisher: Zenodo
Doi

Abstract

<h2>AltGen: 1.3M Plausible Alternatives From Neural Text Generators</h2>The AltGen dataset contains 1.3 million English texts generated by neural language generators conditioned on contexts from three corpora of acceptability judgements and two corpora of reading times. For each corpus, each text generator, and each sampling algorithm,100 generations are sampled—for a total of 1,257,300 generations. Details about the language generators and the corpora are presented in a paper published at EMNLP 2023 (in particular, Section 4). Please cite this paper if you use any version of the dataset in your work:<blockquote>Mario Giulianelli, Sarenne Wallbridge, and Raquel Fernández. 2023. Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.</blockquote>The files are in jsonl format and include a context_id field, which allows retrieving the relevant entry from the original corpus, and the alternatives field, which contains the language model generations. Please note that the alternatives are not post-processed (see code and footnote 2 in the paper for further details). Filenames are built as follows: DecodingAlgorithm_DecodingParameter-nNumAlternatives-maxlen_MaxGenerationLength-sep_Separator.jsonl.</p&gt

Similar works

Full text

Available Versions

ZENODO

oai:zenodo.org:10006413

Last time updated on 10/02/2024