Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

506 research outputs found

Sort by

HPLT’s First Release of Data and Models

Author: Ramírez-Sánchez Gema
Chen Pinzhen
Helcl Jindřich
Zaragoza-Bernabeu Jaume
Malik Bhavitvya
De Gibert Bonet Ona
Stepachev Pavel
Variš Dušan
Haddow Barry
Arefyev Nikolay
Tiedemann Jörg
Publication venue
Publication date: 01/01/2024
Field of study

The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Author: Dušek Ondřej
Kasner Zdeněk
Publication venue
Publication date: 01/01/2024
Field of study

We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design QUINTD – a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with QUINTD. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs

UFAL Speech Corpus of North Levantine Arabic 1.0 - Part 2

Author: Pecina Pavel
Pospíšil Adam
Krubiński Mateusz
Zemánek Petr
Sellat Hashem
Publication venue
Publication date: 01/01/2024
Field of study

The corpus contains recordings by the native speakers of the North Levantine Arabic (apc) acquired during 2020, 2021, and 2023 in Prague, Paris, Kabardia, and St. Petersburg. The data provided in this repository corresponds to the test split of the dialectal Arabic to English shared task hosted at the 21st edition of the International Conference on Spoken Language Translation, i.e., IWSLT 2024

ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility

Author: Schmidtová Patrícia
Balloccu Simone
Dušek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2024
Field of study

In this paper, we describe several reproductions of a human evaluation experiment measuring the quality of automatic dialogue summarization (Feng et al., 2021). We investigate the impact of the annotators’ highest level of education, field of study, and native language on the evaluation of the informativeness of the summary. We find that the evaluation is relatively consistent regardless of these factors, but the biggest impact seems to be a prior specific background in natural language processing (as opposed to, e.g. a background in computer science). We also find that the experiment setup (asking for single vs. multiple criteria) may have an impact on the result

Large Language Models: How they work and what they are good for

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2024
Field of study

A brief explanation of how LLMs work and what they should and shouldn't be used for, including a showcase of failure cases and potential risks

Large Language Models in Chatbot Applications

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2024
Field of study

A short description of the usage of LLMs in task-oriented dialogue, detailing the potential problems and a proposed solution. The presentation included a short demonstration of our recent LLM-based dialogue system

Similarity-Based Cluster Merging for Semantic Change Modeling

Author: Pecina Pavel
Brückner Christopher
Zhang Leixin
Publication venue
Publication date: 01/01/2024
Field of study

This paper describes our contribution to Subtask 1 of the AXOLOTL-24 Shared Task on unsupervised lexical semantic change modeling. In a joint task of word sense disambiguation and word sense induction on diachronic corpora, we significantly outperform the baseline by merging clusters of modern usage examples based on their similarities with the same historical word sense as well as their mutual similarities. We observe that multilingual sentence embeddings outperform language-specific ones in this task

Looking for LLMs' Limits in Dialogue & Data-to-text

Author: Dušek Ondřej
Publication venue
Publication date: 01/01/2024
Field of study

An overview of our recent experiments aiming to find LLMs' limits in the tasks of dialogue modelling and data-to-text generation, including our survey of data leakage in LLMs

Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems

Author: Dušek Ondřej
Warczyński Jędrzej
Lango Mateusz
Publication venue
Publication date: 01/01/2024
Field of study

We introduce a simple approach that uses a large language model (LLM) to automatically implement a fully interpretable rule-based data-to-text system in pure Python. Experimental evaluation on the WebNLG dataset showed that such a constructed system produces text of better quality (according to the BLEU and BLEURT metrics) than the same LLM prompted to directly produce outputs, and produces fewer hallucinations than a BART language model fine-tuned on the same data. Furthermore, at runtime, the approach generates text in a fraction of the processing time required by neural approaches, using only a single CPU

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Author: Gkatzia Dimitra
Howcroft David
Sivaprasad Adarsa
Mahamood Saad
Schmidtová Patrícia
Plátek Ondřej
Dušek Ondřej
Gatt Albert
Balloccu Simone
Publication venue
Publication date: 01/01/2024
Field of study

Automatic metrics are extensively used to evaluate natural language processing systems. However, there has been increasing focus on how they are used and reported by practitioners within the field. In this paper, we have conducted a survey on the use of automatic metrics, focusing particularly on natural language generation (NLG) tasks. We inspect which metrics are used as well as why they are chosen and how their use is reported. Our findings from this survey reveal significant shortcomings, including inappropriate metric usage, lack of implementation details and missing correlations with human judgements. We conclude with recommendations that we believe authors should follow to enable more rigour within the field

58

full texts

506

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics is based in Czechia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇