Search CORE

10 research outputs found

Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling

Author: Hrckova Andrea
Oresko Stefan
Pikuliak Matúš
Šimko Marián
Publication venue
Publication date: 30/11/2023
Field of study

We present GEST -- a new dataset for measuring gender-stereotypical reasoning in masked LMs and English-to-X machine translation systems. GEST contains samples that are compatible with 9 Slavic languages and English for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders). The definition of said stereotypes was informed by gender experts. We used GEST to evaluate 11 masked LMs and 4 machine translation systems. We discovered significant and consistent amounts of stereotypical reasoning in almost all the evaluated models and languages

arXiv.org e-Print Archive

Disinformation Capabilities of Large Language Models

Author: Bielikova Maria
Macko Dominik
Moro Robert
Pikuliak Matúš
Srba Ivan
Vykopal Ivan
Publication venue
Publication date: 15/11/2023
Field of study

Automated disinformation generation is often listed as one of the risks of large language models (LLMs). The theoretical ability to flood the information space with disinformation content might have dramatic consequences for democratic societies around the world. This paper presents a comprehensive study of the disinformation capabilities of the current generation of LLMs to generate false news articles in English language. In our study, we evaluated the capabilities of 10 LLMs using 20 disinformation narratives. We evaluated several aspects of the LLMs: how well they are at generating news articles, how strongly they tend to agree or disagree with the disinformation narratives, how often they generate safety warnings, etc. We also evaluated the abilities of detection models to detect these articles as LLM-generated. We conclude that LLMs are able to generate convincing news articles that agree with dangerous disinformation narratives

arXiv.org e-Print Archive

Cross-Lingual Learning With Distributed Representations

Author: Pikuliak Matúš
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 29/04/2018
Field of study

Cross-lingual Learning can help to bring state-of-the-art deep learning solutions to smaller languages. These languages in general lack resource for training advanced neural networks. With transfer of knowledge across languages we can improve the results for various NLP tasks

Association for the Advancement of Artificial Intelligence: AAAI Publications

Učenie s prenosom medzi prirodzenými jazykmi

Author: Bieliková Mária
Pikuliak Matúš
Šimko Marián
Publication venue: Západočeská univerzita v Plzni
Publication date: 01/01/2017
Field of study

Hlboké učenie sa aktuálne javí ako veľmi perspektívny prístup k mnohým úlohám spracovania prirodzeného jazyka. Tento úspech sa však za-tiaľ prejavuje najmä pri jazykoch, ktoré majú dostatočné množstvo zdrojov na natrénovanie komplexných neurónových sietí. Menšie jazyky s menším množ-stvom zdrojov majú problém tieto nové techniky využiť a priepasť medzi nimi a “bohatými“ jazykmi sa tak prehlbuje. V našej práci sa venujeme tomu, ako túto priepasť zmenšiť pomocou prenosu naučenej informácie z jazyka do jazyka. Hlavnou myšlienkou je trénovanie hlbokých neurónových sietí v multilingvál-nom režime tak, aby sa model naučil využívať znalosti z jedného jazyka aj pre vstupy z iných. Navrhli a vykonali sme experiment s prenosom informácie o sentimente slov pomocou zdieľaného priestoru distribučných vektorov. V ex-perimente sme dosiahli výsledky porovnateľné s nemeckými manuálne vytvo-renými lexikónmi sentimentu, pričom sme však nepoužili žiadne nemecké dáta týkajúce sa sentimentu

University of West Bohemia Digital Library

DSpace at University of West Bohemia

Report on the Current State of Societal Biases in Slovak AI

Author: Burda Kamil
Gavorník Adrián
Hrčková Andrea
Kottulová Janka
Mesarčík Matúš
Oreško Štefan
Pikuliak Matúš
Podroužek Juraj
Szapuová Mariana
Šimko Marián
Publication venue: Zenodo
Publication date: 31/10/2023
Field of study

This report serves as a concluding report of our project Societal Biases in Slovak AI running from November 2022 to October 2023. The goal of our project was to understand how gender biases in particular affect Slovak AI systems, but also to increase the awareness of this issue in both the expert community and the general public, as this is an issue that affects virtually everyone using modern communication technologies. Our aim was to approach this issue from an interdisciplinary point of view. The team included people with technical expertise in AI and natural language processing; but also AI ethics experts, social scientists, translators, and gender experts.The main contribution of our project is the evaluation of multiple types of AI systems, namely English-to-Slovak machine translation systems, Slovak masked language models, and Slovak speech-to-text systems. We have proposed and implemented an evaluation methodology and observed whether these systems exhibit biased behavior with respect to gender. Each experiment was aimed at a specific type of biased behavior, such as male as norm behavior, stereotypical thinking, or non-equitable performance. Worryingly, we were able to discover some sort of problematic behavior in all the models taken into consideration<h3>Contents of the report</h3><ul><li>Chapter 1 introduces the topic, our project and the role of this report.</li><li>Chapter 2 provides an executive summary of this report and presents the main findings and conclusions of the project. It can be read to ascertain what the main outcomes of our project are at glance.</li><li>Chapter 3 introduces the issue of fairness in AI and provides a brief theoretical background for our research. It introduces and discusses terms such as fairness, bias, equality, algorithmic harms and others.</li><li>Chapter 4 is an invited chapter written by gender experts Mariana Szapuová and Janka Kottulová. The chapter discusses what gender stereotypes are, how they impact our society and language.</li><li>Chapter 5 describes the experimental research we conducted. It documents how we evaluated the presence of gender biases in various Slovak AI systems, introduces the results, and discusses the implications of what we have found.</li><li>Chapter 6 documents a data ethics assessment performed during the project to ensure that our research on sensitive issues is as ethical as possible, taking into consideration the needs and points of view of all the stakeholders.</li></ul>The project is co-funded by the U.S. Embassy Bratislava </p&gt

ZENODO

Multilingual Previously Fact-Checked Claim Retrieval

Author: Bielikova Maria
Hromadka Timo
Melisek Martin
Moro Robert
Pikuliak Matúš
Podrouzek Juraj
Simko Jakub
Smolen Timotej
Srba Ivan
Vykopal Ivan
Publication venue
Publication date: 13/05/2023
Field of study

Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset -- MultiClaim -- for previously fact-checked claim retrieval. We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups. This is the most extensive and the most linguistically diverse dataset of this kind to date. We evaluated how different unsupervised methods fare on this dataset and its various dimensions. We show that evaluating such a diverse dataset has its complexities and proper care needs to be taken before interpreting the results. We also evaluated a supervised fine-tuning approach, improving upon the unsupervised method significantly

arXiv.org e-Print Archive

Hate Speech Operationalization: A Preliminary Examination of Hate Speech Indicators and Their Structure

Author: Denisa Fedáková
Ivan Srba
Jana Papcunová
Marcel Martončik
Marián Šimko
Matus Adamkovic
Matúš Pikuliak
Michal Kentoš
Miroslava Bozoganova
Robert Móro
Publication venue: PsyArXiv
Publication date: 09/07/2023
Field of study

Hate speech should be tackled and prosecuted based on how it is operationalized. However, the existing theoretical definitions of hate speech are not sufficiently fleshed out or easily operable. In order to overcome this inadequacy, and with the help of interdisciplinary experts, we propose an empirical definition of hate speech by providing a list of 10 hate speech indicators and the rationale behind them (the indicators refer to specific, observable, and measurable characteristics that offer a practical definition of hate speech). A preliminary exploratory examination of the structure of hate speech, with the focus on comments related to migrants (one of the most reported grounds of hate speech), revealed that two indicators in particular, denial of human rights and promoting violent behavior, occupy a central role in the network of indicators. Furthermore, we discuss the practical implications of the proposed hate speech indicators – especially (semi-)automatic detection using the latest natural language processing (NLP) and machine learning (ML) methods. Having a set of quantifiable indicators could benefit researchers, human right activists, educators, analysts, and regulators by providing them with a pragmatic approach to hate speech assessment and detection

PsyArxiv

MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval

Author: Bieliková Mária
Hromadka Timo
Melišek Martin
Moro Robert
Pikuliak Matúš
Podroužek Juraj
Smoleň Timotej
Srba Ivan
Vykopal Ivan
Šimko Jakub
Publication venue
Publication date: 12/05/2023
Field of study

MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval - is a dataset that can be used to train a test models used for disinformation combatting. The dataset consists of 206k claims fact-checked by professional fact-checkers and 28k social media posts gathered from the wild. Each social media post has at least on claim assigned. The main idea is to develop information retrieval models that will assign appropriate claims to all the posts. GitHub repository: https://github.com/kinit-sk/multiclaim Contents fact_check_post_mapping.csv - Mapping between fact checks and social media posts: fact_check_id post_id fact_checks.csv - Data about fact-checks: fact_check_id claim - This is the translated text (see below) of the fact-check claim instances - Instances of the fact-check – a list of timestamps and URLs. title - This is the translated text (see below) of the fact-check title posts.csv - Data about social media posts: post_id instances - Instances of the fact-check – a list of timestamps and what were the social media platforms. ocr - This is a list of translated texts (see below) of the OCR transcripts based on the images attached to the post. verdicts - This is a list of verdicts attached by Meta (e.g., False information) text - This is the translated text (see below) of the text written by the user. What is a translated text? A tuple of text, its translation to English and detected languages, e.g., in the sample below we have an original Croatian text, its translation to English and finally the predicted language composition (hbs = Serbo-Croatian): ( '"...bolnice su pune ? ti ina, muk...upravo sada, bolnica Rebro..tragi no sme no', '"...hospitals are full? silence, silence... right now, Rebro hospital... tragically funny', [('hbs', 1.0)] ) More details TB

ZENODO

MULTITuDE

Author: Bielikova Maria
Le Thai
Lee Dongwon
Lucas Jason Samuel
Macko Dominik
Moro Robert
Pikuliak Matúš
Simko Jakub
Srba Ivan
Uchendu Adaku
Yamashita Michiharu
Publication venue: Zenodo
Publication date: 17/10/2023
Field of study

MULTITuDE is a dataset for multilingual machine-generated text detection benchmark, described in the <a href="https://arxiv.org/abs/2310.13606">EMNLP 2023 conference paper</a>. It consists of 7992 human-written news texts in 11 languages subsampled from <a href="https://github.com/danielvarab/massive-summ">MassiveSumm</a>, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a <a href="https://github.com/kinit-sk/mgt-detection-benchmark">GitHub repository</a>.If you use this dataset in any publication, project, tool or in any other form, please, cite the <a href="https://arxiv.org/abs/2310.13606">paper</a>.<h2>Fields</h2>The dataset has the following fields:<ul><li>'text' - a text sample,</li><li>'label' - 0 for human-written text, 1 for machine-generated text,</li><li>'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,</li><li>'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,</li><li>'language' - the ISO 639-1 language code identifying the language of the given text,</li><li>'length' - word count of the given text,</li><li>'source' - a string identifying the source dataset / news medium of the given text.</li></ul><h2>Statistics (the number of samples)</h2>Splits:<ul><li>train - 44786</li><li>test - 29295</li></ul>Binary labels:<ul><li>0 - 7992</li><li>1 - 66089</li></ul>Multiclass labels:<ul><li>gpt-3.5-turbo -       8300</li><li>gpt-4 -                    8300</li><li>text-davinci-003 -   8297</li><li>alpaca-lora-30b -   8290</li><li>vicuna-13b -          8287</li><li>opt-66b -                8229</li><li>llama-65b -            8229</li><li>opt-iml-max-1.3b - 8157</li><li>human -                 7992</li></ul>Languages:<ul><li>English (en) - 29460 (train + test)</li><li>Spanish (es) - 11586 (train + test)</li><li>Russian (ru) - 11578 (train + test)</li><li>Dutch (nl) - 2695 (test)</li><li>Catalan (ca) - 2691 (test)</li><li>Czech (cs) - 2689 (test)</li><li>German (de) - 2685 (test)</li><li>Chinese (zh) - 2683 (test)</li><li>Portuguese (pt) - 2673 (test)</li><li>Arabic (ar) - 2673 (test)</li><li>Ukrainian (uk) - 2668 (test)</li></ul&gt

ZENODO