335,304 research outputs found
A Dataset and Strong Baselines for Classification of Czech News Texts
Pre-trained models for Czech Natural Language Processing are often evaluated
on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple
classification tasks such as sentiment classification or article classification
from a single news source. As an alternative, we present
CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech
classification datasets, composed of news articles from various sources
spanning over twenty years, which allows a more rigorous evaluation of such
models. We define four classification tasks: news source, news category,
inferred author's gender, and day of the week. To verify the task difficulty,
we conducted a human evaluation, which revealed that human performance lags
behind strong machine-learning baselines built upon pre-trained transformer
models. Furthermore, we show that language-specific pre-trained encoder
analysis outperforms selected commercially available large-scale generative
language models.Comment: 12 pages, Accepted to Text, Speech and Dialogue (TSD) 202
An Ensemble Approach to Question Classification: Integrating Electra Transformer, GloVe, and LSTM
This paper introduces a novel ensemble approach for question classification
using state-of-the-art models -- Electra, GloVe, and LSTM. The proposed model
is trained and evaluated on the TREC dataset, a well-established benchmark for
question classification tasks. The ensemble model combines the strengths of
Electra, a transformer-based model for language understanding, GloVe, a global
vectors for word representation, and LSTM, a recurrent neural network variant,
providing a robust and efficient solution for question classification.
Extensive experiments were carried out to compare the performance of the
proposed ensemble approach with other cutting-edge models, such as BERT,
RoBERTa, and DistilBERT. Our results demonstrate that the ensemble model
outperforms these models across all evaluation metrics, achieving an accuracy
of 0.8 on the test set. These findings underscore the effectiveness of the
ensemble approach in enhancing the performance of question classification
tasks, and invite further exploration of ensemble methods in natural language
processing
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Despite the progress we have recorded in the last few years in multilingual
natural language processing, evaluation is typically limited to a small set of
languages with available datasets which excludes a large number of low-resource
languages. In this paper, we created SIB-200 -- a large-scale open-sourced
benchmark dataset for topic classification in 200 languages and dialects to
address the lack of evaluation dataset for Natural Language Understanding
(NLU). For many of the languages covered in SIB-200, this is the first publicly
available evaluation dataset for NLU. The dataset is based on Flores-200
machine translation corpus. We annotated the English portion of the dataset and
extended the sentence-level annotation to the remaining 203 languages covered
in the corpus. Despite the simplicity of this task, our evaluation in
full-supervised setting, cross-lingual transfer setting and prompting of large
language model setting show that there is still a large gap between the
performance of high-resource and low-resource languages when multilingual
evaluation is scaled to numerous world languages. We found that languages
unseen during the pre-training of multilingual language models,
under-represented language families (like Nilotic and Altantic-Congo), and
languages from the regions of Africa, Americas, Oceania and South East Asia,
often have the lowest performance on our topic classification dataset. We hope
our dataset will encourage a more inclusive evaluation of multilingual language
models on a more diverse set of languages. https://github.com/dadelani/sib-200Comment: under submissio
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
Large Language Models (LLMs) have emerged as one of the most important
breakthroughs in natural language processing (NLP) for their impressive skills
in language generation and other language-specific tasks. Though LLMs have been
evaluated in various tasks, mostly in English, they have not yet undergone
thorough evaluation in under-resourced languages such as Bengali (Bangla). In
this paper, we evaluate the performance of LLMs for the low-resourced Bangla
language. We select various important and diverse Bangla NLP tasks, such as
abstractive summarization, question answering, paraphrasing, natural language
inference, text classification, and sentiment analysis for zero-shot evaluation
with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with
state-of-the-art fine-tuned models. Our experimental results demonstrate an
inferior performance of LLMs for different Bangla NLP tasks, calling for
further effort to develop better understanding of LLMs in low-resource
languages like Bangla.Comment: First two authors contributed equall
Impact of Position Bias on Language Models in Token Classification
Language Models (LMs) have shown state-of-the-art performance in Natural
Language Processing (NLP) tasks. Downstream tasks such as Named Entity
Recognition (NER) or Part-of-Speech (POS) tagging are known to suffer from data
imbalance issues, specifically in terms of the ratio of positive to negative
examples, and class imbalance. In this paper, we investigate an additional
specific issue for language models, namely the position bias of positive
examples in token classification tasks. Therefore, we conduct an in-depth
evaluation of the impact of position bias on the performance of LMs when
fine-tuned on Token Classification benchmarks. Our study includes CoNLL03 and
OntoNote5.0 for NER, English Tree Bank UD_en and TweeBank for POS tagging. We
propose an evaluation approach to investigate position bias in Transformer
models. We show that encoders like BERT, ERNIE, ELECTRA, and decoders such as
GPT2 and BLOOM can suffer from this bias with an average drop of 3\% and 9\% in
their performance. To mitigate this effect, we propose two methods: Random
Position Shifting and Context Perturbation, that we apply on batches during the
training process. The results show an improvement of 2\% in the
performance of the model on CoNLL03, UD_en, and TweeBank
Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection
Pre-training large neural language models, such as BERT, has led to
impressive gains on many natural language processing (NLP) tasks. Although this
method has proven to be effective for many domains, it might not always provide
desirable benefits. In this paper, we study the effects of hateful pre-training
on low-resource hate speech classification tasks. While previous studies on the
English language have emphasized its importance, we aim to augment their
observations with some non-obvious insights. We evaluate different variations
of tweet-based BERT models pre-trained on hateful, non-hateful, and mixed
subsets of a 40M tweet dataset. This evaluation is carried out for the Indian
languages Hindi and Marathi. This paper is empirical evidence that hateful
pre-training is not the best pre-training option for hate speech detection. We
show that pre-training on non-hateful text from the target domain provides
similar or better results. Further, we introduce HindTweetBERT and
MahaTweetBERT, the first publicly available BERT models pre-trained on Hindi
and Marathi tweets, respectively. We show that they provide state-of-the-art
performance on hate speech classification tasks. We also release hateful BERT
for the two languages and a gold hate speech evaluation benchmark HateEval-Hi
and HateEval-Mr consisting of manually labeled 2000 tweets each. The models and
data are available at https://github.com/l3cube-pune/MarathiNLP
- …