Charles University

Biblio at Institute of Formal and Applied Linguistics

Not a member yet

506 research outputs found

Sort by

EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

Author: Zouhar Vilém
Kloudová Věra
Bojar Ondřej
Bhattacharya Sunit
Publication venue
Publication date: 01/01/2025
Field of study

We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants. The objective was to collect cognitive signals as responses of participants engaged in a number of language intensive tasks involving different text-image stimuli settings when translating from English to Czech. Each participant was exposed to 32 text-image stimuli pairs and asked to (1) read the English sentence, (2) translate it into Czech, (3) consult the image, (4) translate again, either updating or repeating the previous translation. The text stimuli consisted of 200 unique sentences with 616 unique words coupled with 200 unique images as the visual stimuli. The recordings were collected over a two week period and all the participants included in the study were Czech natives with strong English skills. Due to the nature of the tasks involved in the study and the relatively large number of participants involved, the corpus is well suited for research in Translation Process Studies, Cognitive Sciences among other disciplines

Towards Semantic Tagging of Segmented Holocaust Narratives

Author: Pecina Pavel
Brückner Christopher
Publication venue
Publication date: 01/01/2025
Field of study

With the increasing loss of Holocaust witnesses, it is becoming more and more important to preserve their memories. Items of cultural heritage, including textual data such as diaries or transcripts of video interviews, are abundant. However, large amounts of this data are not annotated, which poses a significant obstacle for domain experts curating digitized information regarding the Holocaust. A solution for this problem is a natural language processing model that links text segments to a rich domain-specific ontology of subject terms to automatically tag documents for further processing. While we have not yet achieved a comprehensive solution, we show that even a simple model fine-tuned on a small dataset of spoken narratives is a promising first step and transfers its capabilities to written testimonies reasonably well

What do the eyes really see? An eye-tracking account of language processing

Author: Zouhar Vilém
Kloudová Věra
Bojar Ondřej
Bhattacharya Sunit
Publication venue
Publication date: 01/01/2025
Field of study

This experimental study aims to investigate the translation process from English to Czech in a multimodal scenario by using an eye tracker. We investigate specific aspects of translating ambiguous and unambiguous sentences, and simultaneously, we focus on the possible impact of visual information on the translation process. Thus, we show how mechanisms of visual search, as well as the presence and attention mechanisms involved in such translation processes, can be explored based on various eye-movement data, i.e., cognitive mechanisms involved in reading original sentences and producing the corresponding translation are studied using a plethora of eye-tracking-specific metrics. Among other things, the paper demonstrates how the Stroop effect is visible in the experimental setup

When Multilingual Models Compete with Monolingual Domain-Specific Models in Clinical Question Answering

Author: Pecina Pavel
Lanz Vojtěch
Publication venue
Publication date: 01/01/2025
Field of study

This paper explores the performance of multilingual models in the general domain on the clinical Question Answering (QA) task to observe their potential medical support for languages that do not benefit from the existence of clinically trained models. In order to improve the model’s performance, we exploit multilingual data augmentation by translating an English clinical QA dataset into six other languages. We propose a translation pipeline including projection of the evidences (answers) into the target languages and thoroughly evaluate several multilingual models fine-tuned on the augmented data, both in mono- and multilingual settings. We find that the translation itself and the subsequent QA experiments present a differently challenging problem for each of the languages. Finally, we compare the performance of multilingual models with pretrained medical domain-specific English models on the original clinical English test set. Contrary to expectations, we find that monolingual domain-specific pretraining is not always superior to general-domain multilingual pretraining. The source code is available at https://github.com/lanzv/Multilingual-emrQ

Detection of Meaningful Quotes in Holocaust Testimonies

Author: Pecina Pavel
Brückner Christopher
Publication venue
Publication date: 01/01/2025
Field of study

Quotes from novels, dramas, or speeches can pique readers’ interest in engaging with the source medium and make their experience with this medium more memorable. The same concept might apply to narratives of Nazi persecution and incentivize readers to learn more about victims by reading the diaries and testimonies from which quotes have been extracted. In this paper, we train a sequence-to-sequence model on the extraction of book and movie quotes and evaluate how well its knowledge transfers to Holocaust testimonies. We manually annotate hundreds of automatically extracted text excerpts and assess their characteristics to answer the question "What makes a quote in this domain meaningful?" and whether this can be decided computationally

Hierarchical Classification of Propaganda Techniques in Slavic Texts in Hyperbolic Space

Author: Pecina Pavel
Brückner Christopher
Publication venue
Publication date: 01/01/2025
Field of study

Classification problems can often be tackled by modeling label hierarchies with broader categories in a graph and solving the task via node classification. While recent advances have shown that hyperbolic space is more suitable than Euclidean space for learning graph representations, this concept has yet to be applied to text classification, where node features first need to be extracted from text embeddings. A prototype of such an architecture is this contribution to the Slavic NLP 2025 shared task on the multi-label classification of persuasion techniques in parliamentary debates and social media posts. We do not achieve state-of-the-art performance, but outline the benefits of this hierarchical node classification approach and the advantages of hyperbolic graph embeddings

CUNI-a at ArchEHR-QA 2025: Do we need Giant LLMs for Clinical QA?

Author: Pecina Pavel
Lanz Vojtěch
Publication venue
Publication date: 01/01/2025
Field of study

In this paper, we present our submission to the ArchEHR-QA 2025 shared task, which focuses on answering patient questions based on excerpts from electronic health record (EHR) discharge summaries. Our approach identifies essential sentences relevant to a patient’s question using a combination of few-shot inference with the Med42-8B model, cosine similarity over clinical term embeddings, and the MedCPT cross-encoder relevance model. Then, concise answers are generated on the basis of these selected sentences. Despite not relying on large language models (LLMs) with tens of billions of parameters, our method achieves competitive results, demonstrating the potential of resource-efficient solutions for clinical NLP applications

Paragraph Retrieval for Enhanced Question Answering in Clinical Documents

Author: Pecina Pavel
Lanz Vojtěch
Publication venue
Publication date: 01/01/2024
Field of study

Healthcare professionals often manually extract information from large clinical documents to address patient-related questions. The use of Natural Language Processing (NLP) techniques, particularly Question Answering (QA) models, is a promising direction for improving the efficiency of this process. However, document-level QA from large documents is often impractical or even infeasible (for model training and inference). In this work, we solve the document-level QA from clinical reports in a two-step approach: first, the entire report is split into segments and for a given question the most relevant segment is predicted by a NLP model; second, a QA model is applied to the question and the retrieved segment as context. We investigate the effectiveness of heading-based and naive paragraph segmentation approaches for various paragraph lengths on two subsets of the emrQA dataset. Our experiments reveal that an average paragraph length used as a parameter for the segmentation has no significant effect on performance during the whole document-level QA process. That means experiments focusing on segmentation into shorter paragraphs perform similarly to those focusing on entire unsegmented reports. Surprisingly, naive uniform segmentation is sufficient even though it is not based on prior knowledge of the clinical document's characteristics

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Author: Schmidtová Patrícia
Balloccu Simone
Dušek Ondřej
Lango Mateusz
Publication venue
Publication date: 01/01/2024
Field of study

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where models are iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the fi rst year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on https://leak-llm.github.io/, where other researchers can contribute to our efforts

Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems

Author: Dušek Ondřej
Warczyński Jędrzej
Lango Mateusz
Publication venue
Publication date: 01/01/2024
Field of study

We introduce a simple approach that uses a large language model (LLM) to automatically implement a fully interpretable rule-based data-to-text system in pure Python. Experimental evaluation on the WebNLG dataset showed that such a constructed system produces text of better quality (according to the BLEU and BLEURT metrics) than the same LLM prompted to directly produce outputs, and produces fewer hallucinations than a BART language model fine-tuned on the same data. Furthermore, at runtime, the approach generates text in a fraction of the processing time required by neural approaches, using only a single CPU

58

full texts

506

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics is based in Czechia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇