5 research outputs found
LLM-powered Data Augmentation for Enhanced Cross-lingual Performance
This paper explores the potential of leveraging Large Language Models (LLMs)
for data augmentation in multilingual commonsense reasoning datasets where the
available training data is extremely limited. To achieve this, we utilise
several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment
three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate
the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR,
using the synthesised data. We compare the performance of training with data
generated in English and target languages, as well as translated
English-generated data, revealing the overall advantages of incorporating data
generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best
case. Furthermore, we conduct a human evaluation by asking native speakers to
assess the naturalness and logical coherence of the generated examples across
different languages. The results of the evaluation indicate that LLMs such as
ChatGPT and GPT-4 excel at producing natural and coherent text in most
languages, however, they struggle to generate meaningful text in certain
languages like Tamil. We also observe that ChatGPT falls short in generating
plausible alternatives compared to the original dataset, whereas examples from
GPT-4 exhibit competitive logical consistency.Comment: EMNLP 2023 Main Conferenc
Evaluation of Fake News Detection with Knowledge-Enhanced Language Models
Recent advances in fake news detection have exploited the success of
large-scale pre-trained language models (PLMs). The predominant
state-of-the-art approaches are based on fine-tuning PLMs on labelled fake news
datasets. However, large-scale PLMs are generally not trained on structured
factual data and hence may not possess priors that are grounded in factually
accurate knowledge. The use of existing knowledge bases (KBs) with rich
human-curated factual information has thus the potential to make fake news
detection more effective and robust. In this paper, we investigate the impact
of knowledge integration into PLMs for fake news detection. We study several
state-of-the-art approaches for knowledge integration, mostly using Wikidata as
KB, on two popular fake news datasets - LIAR, a politics-based dataset, and
COVID-19, a dataset of messages posted on social media relating to the COVID-19
pandemic. Our experiments show that knowledge-enhanced models can significantly
improve fake news detection on LIAR where the KB is relevant and up-to-date.
The mixed results on COVID-19 highlight the reliance on stylistic features and
the importance of domain specific and current KBs.Comment: To appear in Proceedings of the 16th International AAAI Conference on
Web and Social Media (AAAI ICWSM-2022
Parameter-Efficient Multilingual Summarisation: An Empirical Study
With the increasing prevalence of Large Language Models, traditional full
fine-tuning approaches face growing challenges, especially in memory-intensive
tasks. This paper investigates the potential of Parameter-Efficient
Fine-Tuning, focusing on Low-Rank Adaptation (LoRA), for complex and
under-explored multilingual summarisation tasks. We conduct an extensive study
across different data availability scenarios, including full-data, low-data,
and cross-lingual transfer, leveraging models of different sizes. Our findings
reveal that LoRA lags behind full fine-tuning when trained with full data,
however, it excels in low-data scenarios and cross-lingual transfer.
Interestingly, as models scale up, the performance gap between LoRA and full
fine-tuning diminishes. Additionally, we investigate effective strategies for
few-shot cross-lingual transfer, finding that continued LoRA tuning achieves
the best performance compared to both full fine-tuning and dynamic composition
of language-specific LoRA modules
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Large language models (LLMs) have demonstrated remarkable capability to
generate fluent responses to a wide variety of user queries, but this has also
resulted in concerns regarding the potential misuse of such texts in
journalism, educational, and academic context. In this work, we aim to develop
automatic systems to identify machine-generated text and to detect potential
misuse. We first introduce a large-scale benchmark M4, which is
multi-generator, multi-domain, and multi-lingual corpus for machine-generated
text detection. Using the dataset, we experiment with a number of methods and
we show that it is challenging for detectors to generalize well on unseen
examples if they are either from different domains or are generated by
different large language models. In such cases, detectors tend to misclassify
machine-generated text as human-written. These results show that the problem is
far from solved and there is a lot of room for improvement. We believe that our
dataset M4, which covers different generators, domains and languages, will
enable future research towards more robust approaches for this pressing
societal problem. The M4 dataset is available at
https://github.com/mbzuai-nlp/M4.Comment: 11 page
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M