AltGen: 1.3M Plausible Alternatives From Neural Text Generators

Abstract

<h2>AltGen: 1.3M Plausible Alternatives From Neural Text Generators</h2><p>The AltGen dataset contains 1.3 million English texts generated by neural language generators conditioned on contexts from three corpora of acceptability judgements and two corpora of reading times. </p><p>For each corpus, each text generator, and each sampling algorithm,100 generations are sampled—for a total of 1,257,300 generations. Details about the language generators and the corpora are presented in a paper published at EMNLP 2023 (in particular, Section 4). Please cite this paper if you use any version of the dataset in your work:</p><blockquote><p>Mario Giulianelli, Sarenne Wallbridge, and Raquel Fernández. 2023. <strong>Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives</strong>. In <i>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</i>. Association for Computational Linguistics.</p></blockquote><p>The files are in jsonl format and include a <i>context_id</i> field, which allows retrieving the relevant entry from the original corpus, and the <i>alternatives</i> field, which contains the language model generations. Please note that the alternatives are not post-processed (see code and footnote 2 in the paper for further details). Filenames are built as follows: <i>DecodingAlgorithm</i>_<i>DecodingParameter</i>-n<i>NumAlternatives</i>-maxlen_<i>MaxGenerationLength</i>-sep_<i>Separator.</i>jsonl.</p&gt

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 10/02/2024