506 research outputs found
Sort by
HPLT’s First Release of Data and Models
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public
Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design QUINTD – a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with QUINTD. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs
UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2
The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024
ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer science). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the result
Large Language Models: How they work and what they are good for
A brief explanation of how LLMs work and what they should and shouldn't be used for, including a showcase of failure cases and potential risks
Large Language Models in Chatbot Applications
A short description of the usage of LLMs in task-oriented dialogue, detailing the potential problems and a proposed solution. The presentation included a short demonstration of our recent LLM-based dialogue system
Similarity-Based Cluster Merging for Semantic Change Modeling
This paper describes our contribution to Subtask 1 of the AXOLOTL-24 Shared Task on unsupervised lexical semantic change modeling. In a joint task of word sense disambiguation and word sense induction on diachronic corpora, we significantly outperform the baseline by merging clusters of modern usage examples based on their similarities with the same historical word sense as well as their mutual similarities. We observe that multilingual sentence embeddings outperform language-specific ones in this task
Looking for LLMs' Limits in Dialogue & Data-to-text
An overview of our recent experiments aiming to find LLMs' limits in the tasks of dialogue modelling and data-to-text generation, including our survey of data leakage in LLMs
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
We introduce a simple approach that uses a large language model (LLM) to automatically implement a fully interpretable rule-based data-to-text system in pure Python. Experimental evaluation on the WebNLG dataset showed that such a constructed system produces text of better quality (according to the BLEU and BLEURT metrics) than the same LLM prompted to directly produce outputs, and produces fewer hallucinations than a BART language model fine-tuned on the same data. Furthermore, at runtime, the approach generates text in a fraction of the processing time required by neural approaches, using only a single CPU
Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices
Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field