Charles University

Biblio at Institute of Formal and Applied Linguistics

DocMarker

Author: Pecina Pavel
Mayer Jiří
Publication venue
Publication date: 01/01/2023
Field of study

DocMarker is an annotation tool for creating training data for the text-to-form information retrieval NLP task. Say you have a free-form text (rich-text maybe) that contains some information that should be filled out into some structured form. This tool lets you record and annotate this form-filling process

Learning capabilities in Transformer Neural Networks

Author: Variš Dušan
Publication venue
Publication date: 01/01/2023
Field of study

Although the contemporary neural networks, inspired by biological neurons, were able to reach human-like performance on many tasks in recent years, their optimization (learning) process is still very far from the one observed in humans. This thesis investigates various aspects of learning in the current state-of-the-art Transformer neural networks, the dominant architecture in the current neural language processing. Firstly, we measure the level of generalization in Transformers using several probing experiments based on the idea of adversarial evaluation. Secondly, we explore their potential for incremental learning when combined with regularization using the elastic weight consolidation approach. Lastly, we propose a modular extension of the existing Transformer architecture enabling subnetwork selection conditioned on the intermediate hidden layer outputs and analyze the attributes of this network modularization. We investigate our hypotheses mainly within the scope of neural machine translation and multilingual translation showing the limitations of the original Transformer and the elastic weights consolidation regularization while presenting promising results of the novel modular Transformer architecture

Velké jazykové modely: Co znamená velké a co jazykové?

Author: Libovický Jindřich
Publication venue
Publication date: 01/01/2023
Field of study

The talk introduced large language models: their training and application, and related research conducted at ÚFAL MFF UK

Tackling Hallucinations in Neural Chart Summarization

Author: Dušek Ondřej
Obaid ul Islam Saad
Škrjanec Iza
Demberg Vera
Publication venue
Publication date: 01/01/2023
Field of study

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we tackle the problem of hallucinations in neural chart summarization. Our analysis shows that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and show through human evaluation that our method significantly reduces hallucinations. We also found that shortening long-distance dependencies in the input sequence and adding chart-related information like title and legends improves the overall performance

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Author: Dušek Ondřej
Plátek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

We present MooseNet, a trainable speech metric that predicts the listeners’ Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even stateof-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model finetuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data

TabGenie: A Toolkit for Table-to-Text Generation

Author: Garanina Ekaterina
Kasner Zdeněk
Dušek Ondřej
Plátek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Heterogenity of data-to-text generation datasets limits the research on data-to-text generation systems. We present TabGenie - a toolkit which enables researchers to explore, preprocess, and analyze a variety of data-to-text generation datasets through the unified framework of table-to-text generation. In TabGenie, all inputs are represented as tables with associated metadata. The tables can be explored through a web interface, which also provides an interactive mode for debugging table-to-text generation, facilitates side-by-side comparison of generated system outputs, and allows easy exports for manual analysis. Furthermore, TabGenie is equipped with command line processing tools and Python bindings for unified dataset loading and processing. We release TabGenie as a PyPI package and provide its open-source code and a live demo at https://github.com/kasnerz/tabgenie

Sustaining the European Language Grid: Towards the ELG Legal Entity

Author: Backfried Gerhard
Piperidis Stelios
Rehm Georg
Hajič Jan
Vasiljevs Andrejs
Choukri Khalid
Hegele Stefanie
Germann Ulrich
Marheinecke Katrin
Gómez-Pérez José Manuel
Prinz Katja
Bontcheva Kalina
Publication venue: Springer Nature Switzerland AG
Publication date: 01/01/2023
Field of study

When preparing the European Language Grid EU project proposal and designing the overall concept of the platform, the need for drawing up a long-term sustainability plan was abundantly evident. Already in the phase of developing the proposal, the centrepiece of the sustainability plan was what we called the “ELG legal entity”, i. e., an independent organisation that would be able to take over operations, maintenace, extension and governance of the European Language Grid platform as well as managing and helping to coordinate its community. This chapter describes our current state of planning with regard to this legal entity. It explains the different options discussed and it presents the different products specified, which can be offered by the legal entity in the medium to long run. We also describe which legal form the organisation will take and how it will ensure the sustainability of ELG

Are Large Language Models All You Need for Task-Oriented Dialogue?

Author: Hudeček Vojtěch
Dušek Ondřej
Publication venue
Publication date: 01/01/2023
Field of study

Instruction-finetuned large language models (LLMs) gained a huge popularity recently, thanks to their ability to interact with users through conversation. In this work, we aim to evaluate their ability to complete multi-turn tasks and interact with external databases in the context of established task-oriented dialogue benchmarks. We show that in explicit belief state tracking, LLMs underperform compared to specialized task-specific models. Nevertheless, they show some ability to guide the dialogue to a successful ending through their generated responses if they are provided with correct slot values. Furthermore, this ability improves with few-shot in-domain examples

Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity

Author: Libovický Jindřich
Fastowski Alina
Fraser Alexander
Hämmerl Katharina
Publication venue
Publication date: 01/01/2023
Field of study

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research.We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility

UFAL-ULD at BLP-2023 Task 2 Sentiment Classification in Bangla Text

Author: Mukherjee Sourabrata
Dušek Ondřej
Ojha Atul
Publication venue
Publication date: 01/01/2023
Field of study

In this paper, we present the UFAL-ULD team's system for the BLP Shared Task 2: Sentiment Analysis of Bangla Social Media Posts. The Task 2 involves classifying text into Positive, Negative, or Neutral sentiments. As a part of this task, we conducted a series of experiments with several pre-trained sequence classification models -- XLM-RoBERTa, BanglaBERT, Bangla BERT Base and Multilingual BERT. Among these, our best-performing model was based on the XLM-RoBERTa-base architecture, which outperforms baseline models. Our system was ranked 19th among the 30 teams that participated in the task

58

full texts

462

metadata records

Updated in last 30 days.

Biblio at Institute of Formal and Applied Linguistics is based in Czechia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇