Search CORE

335,304 research outputs found

A Dataset and Strong Baselines for Classification of Czech News Texts

Author: Kydlíček Hynek
Libovický Jindřich
Publication venue
Publication date: 20/07/2023
Field of study

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.Comment: 12 pages, Accepted to Text, Speech and Dialogue (TSD) 202

arXiv.org e-Print Archive

An Ensemble Approach to Question Classification: Integrating Electra Transformer, GloVe, and LSTM

Author: Aburass Sanad
Dorgham Osama
Publication venue
Publication date: 13/08/2023
Field of study

This paper introduces a novel ensemble approach for question classification using state-of-the-art models -- Electra, GloVe, and LSTM. The proposed model is trained and evaluated on the TREC dataset, a well-established benchmark for question classification tasks. The ensemble model combines the strengths of Electra, a transformer-based model for language understanding, GloVe, a global vectors for word representation, and LSTM, a recurrent neural network variant, providing a robust and efficient solution for question classification. Extensive experiments were carried out to compare the performance of the proposed ensemble approach with other cutting-edge models, such as BERT, RoBERTa, and DistilBERT. Our results demonstrate that the ensemble model outperforms these models across all evaluation metrics, achieving an accuracy of 0.8 on the test set. These findings underscore the effectiveness of the ensemble approach in enhancing the performance of question classification tasks, and invite further exploration of ensemble methods in natural language processing

arXiv.org e-Print Archive

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Author: Adelani David Ifeoluwa
Alabi Jesujoba O.
Gao Haonan
Lee Annie En-Shiun
Liu Hannah
Mao Yanke
Shen Xiaoyu
Vassilyev Nikita
Publication venue
Publication date: 14/09/2023
Field of study

Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200Comment: under submissio

arXiv.org e-Print Archive

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Author: Bari M Saiful
Hoque Enamul
Islam Mohammed Saidul
Kabir Mohsinul
Laskar Md Tahmid Rahman
Nayeem Mir Tafseer
Publication venue
Publication date: 22/09/2023
Field of study

Large Language Models (LLMs) have emerged as one of the most important breakthroughs in natural language processing (NLP) for their impressive skills in language generation and other language-specific tasks. Though LLMs have been evaluated in various tasks, mostly in English, they have not yet undergone thorough evaluation in under-resourced languages such as Bengali (Bangla). In this paper, we evaluate the performance of LLMs for the low-resourced Bangla language. We select various important and diverse Bangla NLP tasks, such as abstractive summarization, question answering, paraphrasing, natural language inference, text classification, and sentiment analysis for zero-shot evaluation with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with state-of-the-art fine-tuned models. Our experimental results demonstrate an inferior performance of LLMs for different Bangla NLP tasks, calling for further effort to develop better understanding of LLMs in low-resource languages like Bangla.Comment: First two authors contributed equall

arXiv.org e-Print Archive

Impact of Position Bias on Language Models in Token Classification

Author: Amor Mehdi Ben
Granitzer Michael
Mitrović Jelena
Publication venue
Publication date: 26/04/2023
Field of study

Language Models (LMs) have shown state-of-the-art performance in Natural Language Processing (NLP) tasks. Downstream tasks such as Named Entity Recognition (NER) or Part-of-Speech (POS) tagging are known to suffer from data imbalance issues, specifically in terms of the ratio of positive to negative examples, and class imbalance. In this paper, we investigate an additional specific issue for language models, namely the position bias of positive examples in token classification tasks. Therefore, we conduct an in-depth evaluation of the impact of position bias on the performance of LMs when fine-tuned on Token Classification benchmarks. Our study includes CoNLL03 and OntoNote5.0 for NER, English Tree Bank UD_en and TweeBank for POS tagging. We propose an evaluation approach to investigate position bias in Transformer models. We show that encoders like BERT, ERNIE, ELECTRA, and decoders such as GPT2 and BLOOM can suffer from this bias with an average drop of 3\% and 9\% in their performance. To mitigate this effect, we propose two methods: Random Position Shifting and Context Perturbation, that we apply on batches during the training process. The results show an improvement of

\approx

2\% in the performance of the model on CoNLL03, UD_en, and TweeBank

arXiv.org e-Print Archive

Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection

Author: Chavan Tanmay
Gokhale Omkar
Joshi Raviraj
Kane Aditya
Patankar Shantanu
Publication venue
Publication date: 11/12/2022
Field of study

Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper, we study the effects of hateful pre-training on low-resource hate speech classification tasks. While previous studies on the English language have emphasized its importance, we aim to augment their observations with some non-obvious insights. We evaluate different variations of tweet-based BERT models pre-trained on hateful, non-hateful, and mixed subsets of a 40M tweet dataset. This evaluation is carried out for the Indian languages Hindi and Marathi. This paper is empirical evidence that hateful pre-training is not the best pre-training option for hate speech detection. We show that pre-training on non-hateful text from the target domain provides similar or better results. Further, we introduce HindTweetBERT and MahaTweetBERT, the first publicly available BERT models pre-trained on Hindi and Marathi tweets, respectively. We show that they provide state-of-the-art performance on hate speech classification tasks. We also release hateful BERT for the two languages and a gold hate speech evaluation benchmark HateEval-Hi and HateEval-Mr consisting of manually labeled 2000 tweets each. The models and data are available at https://github.com/l3cube-pune/MarathiNLP

arXiv.org e-Print Archive