16 research outputs found

    Incorporating Language Models into Non-autoregressive Neural Machine Translation

    Get PDF
    V této práci navrhujeme způsob pro zlepšení plynulosti výstupu neautoregresivního modelu pro neuronový strojový překlad. Využíváme k tomu rozšířený model pro počítání skóre během paprskového prohledávání. Skóre vypočítáváme jako lineární kombinaci dílčích skóre pocházejících z n-gramového jazykového modelu a dalších pomocných příznaků. Váhy pro lineární kombinaci určujeme pomocí strukturovaného perceptronu. Pro vyhodnocení rychlosti a kvality překladu trénujeme modely pro tři dvojice jazyků. Výsledky ukazují, že modely s navrženým vylepšením jsou stále dostatečně efektivní z hlediska rychlosti a zároveň dosahují výsledků srovnatelných s autoregresivními modely.In order to improve the fluency of a non-autoregressive model for neural machine translation, we propose an extension for the scoring model used during the beam search decoding. We compute the score as a linear combination of feature values, including the score from an n-gram language model and other auxiliary features. We determine the weights of the features using the structured perceptron algorithm. We train the models for three language pairs and evaluate their decoding speed and translation quality. The results show that our proposed models are still efficient in terms of decoding speed while achieving a competitive score relative to autoregressive models

    Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models

    Full text link
    Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.Comment: Long paper at EACL '23. Code and data: https://github.com/kasnerz/rel2tex

    Expand and Filter: CUNI and LMU Systems for the WNGT 2020 Duolingo Shared Task

    Get PDF
    We present our submission to the Simultaneous Translation And Paraphrase for Language Education (STAPLE) challenge. We used a standard Transformer model for translation, with a crosslingual classifier predicting correct translations on the output n-best list. To increase the diversity of the outputs, we used additional data to train the translation model, and we trained a paraphrasing model based on the Levenshtein Transformer architecture to generate further synonymous translations. The paraphrasing results were again filtered using our classifier. While the use of additional data and our classifier filter were able to improve results, the paraphrasing model produced too many invalid outputs to further improve the output quality. Our model without the paraphrasing component finished in the middle of the field for the shared task, improving over the best baseline by a margin of 10-22 % weighted F1 absolute

    TabGenie: A Toolkit for Table-to-Text Generation

    Full text link
    Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie - a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all the inputs are represented as tables with associated metadata. The tables can be explored through the web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package and provide its open-source code and a live demo at https://github.com/kasnerz/tabgenie.Comment: Submitted to ACL 2023 System Demonstration Trac

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License

    Neural Pipeline for Zero-Shot Data-to-Text Generation

    No full text
    In data-to-text (D2T) generation, training on in-domain data leads to overfitting to the data representation and repeating training data noise. We examine how to avoid finetuning pretrained language models (PLMs) on D2T generation datasets while still taking advantage of surface realization capabilities of PLMs. Inspired by pipeline approaches, we propose to generate text by transforming single-item descriptions with a sequence of modules trained on general-domain text-based operations: ordering, aggregation, and paragraph compression. We train PLMs for performing these operations on a synthetic corpus WikiFluent which we build from English Wikipedia. Our experiments on two major triple-to-text datasets — WebNLG and E2E — show that our approach enables D2T generation from RDF triples in zero-shot settings

    Mind the Labels:Describing Relations in Knowledge Graphs With Pretrained Models

    No full text

    Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models

    No full text
    Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains
    corecore