6 research outputs found

    The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages

    Full text link
    Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate remarkable performance in a wide range of tasks. Despite numerous recent studies that examine the performance of instruction-tuned LLMs on various NLP benchmarks, there remains a lack of comprehensive investigation into their ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning embedded within social and interactive contexts. This deficiency arises partly from SM not being adequately represented in any of the existing benchmarks. To address this gap, we present SPARROW, an extensive multilingual benchmark specifically designed for SM understanding. SPARROW comprises 169 datasets covering 13 task types across six primary categories (e.g., anti-social language detection, emotion recognition). SPARROW datasets encompass 64 different languages originating from 12 language families representing 16 writing scripts. We evaluate the performance of various multilingual pretrained language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT) on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our comprehensive analysis reveals that existing open-source instruction tuned LLMs still struggle to understand SM across various languages, performing close to a random baseline in some cases. We also find that although ChatGPT outperforms many LLMs, it still falls behind task-specific finetuned models with a gap of 12.19 SPARROW score. Our benchmark is available at: https://github.com/UBC-NLP/SPARROWComment: Accepted by EMNLP 2023 Main conferenc

    On the Detection of False Information: From Rumors to Fake News

    Full text link
    Tesis por compendio[ES] En tiempos recientes, el desarrollo de las redes sociales y de las agencias de noticias han tra铆do nuevos retos y amenazas a la web. Estas amenazas han llamado la atenci贸n de la comunidad investigadora en Procesamiento del Lenguaje Natural (PLN) ya que est谩n contaminando las plataformas de redes sociales. Un ejemplo de amenaza ser铆an las noticias falsas, en las que los usuarios difunden y comparten informaci贸n falsa, inexacta o enga帽osa. La informaci贸n falsa no se limita a la informaci贸n verificable, sino que tambi茅n incluye informaci贸n que se utiliza con fines nocivos. Adem谩s, uno de los desaf铆os a los que se enfrentan los investigadores es la gran cantidad de usuarios en las plataformas de redes sociales, donde detectar a los difusores de informaci贸n falsa no es tarea f谩cil. Los trabajos previos que se han propuesto para limitar o estudiar el tema de la detecci贸n de informaci贸n falsa se han centrado en comprender el lenguaje de la informaci贸n falsa desde una perspectiva ling眉铆stica. En el caso de informaci贸n verificable, estos enfoques se han propuesto en un entorno monoling眉e. Adem谩s, apenas se ha investigado la detecci贸n de las fuentes o los difusores de informaci贸n falsa en las redes sociales. En esta tesis estudiamos la informaci贸n falsa desde varias perspectivas. En primer lugar, dado que los trabajos anteriores se centraron en el estudio de la informaci贸n falsa en un entorno monoling眉e, en esta tesis estudiamos la informaci贸n falsa en un entorno multiling眉e. Proponemos diferentes enfoques multiling眉es y los comparamos con un conjunto de baselines monoling眉es. Adem谩s, proporcionamos estudios sistem谩ticos para los resultados de la evaluaci贸n de nuestros enfoques para una mejor comprensi贸n. En segundo lugar, hemos notado que el papel de la informaci贸n afectiva no se ha investigado en profundidad. Por lo tanto, la segunda parte de nuestro trabajo de investigaci贸n estudia el papel de la informaci贸n afectiva en la informaci贸n falsa y muestra c贸mo los autores de contenido falso la emplean para manipular al lector. Aqu铆, investigamos varios tipos de informaci贸n falsa para comprender la correlaci贸n entre la informaci贸n afectiva y cada tipo (Propaganda, Trucos / Enga帽os, Clickbait y S谩tira). Por 煤ltimo, aunque no menos importante, en un intento de limitar su propagaci贸n, tambi茅n abordamos el problema de los difusores de informaci贸n falsa en las redes sociales. En esta direcci贸n de la investigaci贸n, nos enfocamos en explotar varias caracter铆sticas basadas en texto extra铆das de los mensajes de perfiles en l铆nea de tales difusores. Estudiamos diferentes conjuntos de caracter铆sticas que pueden tener el potencial de ayudar a discriminar entre difusores de informaci贸n falsa y verificadores de hechos.[CA] En temps recents, el desenvolupament de les xarxes socials i de les ag猫ncies de not铆cies han portat nous reptes i amenaces a la web. Aquestes amenaces han cridat l'atenci贸 de la comunitat investigadora en Processament de Llenguatge Natural (PLN) ja que estan contaminant les plataformes de xarxes socials. Un exemple d'amena莽a serien les not铆cies falses, en qu猫 els usuaris difonen i comparteixen informaci贸 falsa, inexacta o enganyosa. La informaci贸 falsa no es limita a la informaci贸 verificable, sin贸 que tamb茅 inclou informaci贸 que s'utilitza amb fins nocius. A m茅s, un dels desafiaments als quals s'enfronten els investigadors 茅s la gran quantitat d'usuaris en les plataformes de xarxes socials, on detectar els difusors d'informaci贸 falsa no 茅s tasca f脿cil. Els treballs previs que s'han proposat per limitar o estudiar el tema de la detecci贸 d'informaci贸 falsa s'han centrat en comprendre el llenguatge de la informaci贸 falsa des d'una perspectiva ling眉铆stica. En el cas d'informaci贸 verificable, aquests enfocaments s'han proposat en un entorn monoling眉e. A m茅s, gaireb茅 no s'ha investigat la detecci贸 de les fonts o els difusors d'informaci贸 falsa a les xarxes socials. En aquesta tesi estudiem la informaci贸 falsa des de diverses perspectives. En primer lloc, at猫s que els treballs anteriors es van centrar en l'estudi de la informaci贸 falsa en un entorn monoling眉e, en aquesta tesi estudiem la informaci贸 falsa en un entorn multiling眉e. Proposem diferents enfocaments multiling眉es i els comparem amb un conjunt de baselines monoling眉es. A m茅s, proporcionem estudis sistem脿tics per als resultats de l'avaluaci贸 dels nostres enfocaments per a una millor comprensi贸. En segon lloc, hem notat que el paper de la informaci贸 afectiva no s'ha investigat en profunditat. Per tant, la segona part del nostre treball de recerca estudia el paper de la informaci贸 afectiva en la informaci贸 falsa i mostra com els autors de contingut fals l'empren per manipular el lector. Aqu铆, investiguem diversos tipus d'informaci贸 falsa per comprendre la correlaci贸 entre la informaci贸 afectiva i cada tipus (Propaganda, Trucs / Enganys, Clickbait i S脿tira). Finalment, per貌 no menys important, en un intent de limitar la seva propagaci贸, tamb茅 abordem el problema dels difusors d'informaci贸 falsa a les xarxes socials. En aquesta direcci贸 de la investigaci贸, ens enfoquem en explotar diverses caracter铆stiques basades en text extretes dels missatges de perfils en l铆nia de tals difusors. Estudiem diferents conjunts de caracter铆stiques que poden tenir el potencial d'ajudar a discriminar entre difusors d'informaci贸 falsa i verificadors de fets.[EN] In the recent years, the development of social media and online news agencies has brought several challenges and threats to the Web. These threats have taken the attention of the Natural Language Processing (NLP) research community as they are polluting the online social media platforms. One of the examples of these threats is false information, in which false, inaccurate, or deceptive information is spread and shared by online users. False information is not limited to verifiable information, but it also involves information that is used for harmful purposes. Also, one of the challenges that researchers have to face is the massive number of users in social media platforms, where detecting false information spreaders is not an easy job. Previous work that has been proposed for limiting or studying the issue of detecting false information has focused on understanding the language of false information from a linguistic perspective. In the case of verifiable information, approaches have been proposed in a monolingual setting. Moreover, detecting the sources or the spreaders of false information in social media has not been investigated much. In this thesis we study false information from several aspects. First, since previous work focused on studying false information in a monolingual setting, in this thesis we study false information in a cross-lingual one. We propose different cross-lingual approaches and we compare them to a set of monolingual baselines. Also, we provide systematic studies for the evaluation results of our approaches for better understanding. Second, we noticed that the role of affective information was not investigated in depth. Therefore, the second part of our research work studies the role of the affective information in false information and shows how the authors of false content use it to manipulate the reader. Here, we investigate several types of false information to understand the correlation between affective information and each type (Propaganda, Hoax, Clickbait, Rumor, and Satire). Last but not least, in an attempt to limit its spread, we also address the problem of detecting false information spreaders in social media. In this research direction, we focus on exploiting several text-based features extracted from the online profile messages of those spreaders. We study different feature sets that can have the potential to help to identify false information spreaders from fact checkers.Ghanem, BHH. (2020). On the Detection of False Information: From Rumors to Fake News [Tesis doctoral]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/158570TESISCompendi

    Arabic Dialect Texts Classification

    Get PDF
    This study investigates how to classify Arabic dialects in text by extracting features which show the differences between dialects. There has been a lack of research about classification of Arabic dialect texts, in comparison to English and some other languages, due to the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and some other languages. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a medium of communication and as a source of a corpus. We collected tweets from Twitter, comments from Facebook and online newspapers from five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. The research sought to: 1) create a dataset of Arabic dialect texts to use in training and testing the system of classification, 2) find appropriate features to classify Arabic dialects: lexical (word and multi-word-unit) and grammatical variation across dialects, 3) build a more sophisticated filter to extract features from Arabic-character written dialect text files. In this thesis, the first part describes the research motivation to show the reason for choosing the Arabic dialects as a research topic. The second part presents some background information about the Arabic language and its dialects, and the literature review shows previous research about this subject. The research methodology part shows the initial experiment to classify Arabic dialects. The results of this experiment showed the need to create an Arabic dialect text corpus, by exploring Twitter and online newspaper. The corpus used to train the ensemble classifier and to improve the accuracy of classification the corpus was extended by collecting tweets from Twitter based on the spatial coordinate points and comments from Facebook posts. The corpus was annotated with dialect labels and used in automatic dialect classification experiments. The last part of this thesis presents the results of classification, conclusions and future work

    Development of the Arabic Loria Automatic Speech Recognition system (ALASR) and its evaluation for Algerian dialect

    Get PDF
    International audienceThis paper addresses the development of an Automatic Speech Recognition system for Modern Standard Arabic (MSA) and its extension to Algerian dialect. Algerian dialect is very different from Arabic dialects of the Middle-East, since it is highly influenced by the French language. In this article, we start by presenting the new automatic speech recognition named ALASR (Arabic Loria Automatic Speech Recognition) system. The acoustic model of ALASR is based on a DNN approach and the language model is a classical n-gram. Several options are investigated in this paper to find the best combination of models and parameters. ALASR achieves good results for MSA in terms of WER (14.02%), but it completely collapses on an Algerian dialect data set of 70 minutes (a WER of 89%). In order to take into account the impact of the French language, on the Algerian dialect, we combine in ALASR two acoustic models, the original one (MSA) and a French one trained on ESTER corpus. This solution has been adopted because no transcribed speech data for Algerian dialect are available. This combination leads to a substantial absolute reduction of the word error of 24%. c 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 3rd International Conference on Arabic Computational Linguistics
    corecore