168 research outputs found
BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages
Current research on automatic readability assessment (ARA) has focused on
improving the performance of models in high-resource languages such as English.
In this work, we introduce and release BasahaCorpus as part of an initiative
aimed at expanding available corpora and baseline models for readability
assessment in lower resource languages in the Philippines. We compiled a corpus
of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and
Rinconada -- languages belonging to the Central Philippine family tree subgroup
-- to train ARA models using surface-level, syllable-pattern, and n-gram
overlap features. We also propose a new hierarchical cross-lingual modeling
approach that takes advantage of a language's placement in the family tree to
increase the amount of available training data. Our study yields encouraging
results that support previous work showcasing the efficacy of cross-lingual
models in low-resource settings, as well as similarities in highly informative
linguistic features for mutually intelligible languages.Comment: Final camera-ready paper for EMNLP 2023 (Main
Age Recommendation from Texts and Sentences for Children
Children have less text understanding capability than adults. Moreover, this
capability differs among the children of different ages. Hence, automatically
predicting a recommended age based on texts or sentences would be a great
benefit to propose adequate texts to children and to help authors writing in
the most appropriate way. This paper presents our recent advances on the age
recommendation task. We consider age recommendation as a regression task, and
discuss the need for appropriate evaluation metrics, study the use of
state-of-the-art machine learning model, namely Transformers, and compare it to
different models coming from the literature. Our results are also compared with
recommendations made by experts. Further, this paper deals with preliminary
explainability of the age prediction model by analyzing various linguistic
features. We conduct the experiments on a dataset of 3, 673 French texts (132K
sentences, 2.5M words). To recommend age at the text level and sentence level,
our best models achieve MAE scores of 0.98 and 1.83 respectively on the test
set. Also, compared to the recommendations made by experts, our sentence-level
recommendation model gets a similar score to the experts, while the text-level
recommendation model outperforms the experts by an MAE score of 1.48.Comment: 26 pages (incl. 4 pages for appendices), 4 figures, 20 table
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
Large Language Models (LLMs) have emerged as one of the most important
breakthroughs in natural language processing (NLP) for their impressive skills
in language generation and other language-specific tasks. Though LLMs have been
evaluated in various tasks, mostly in English, they have not yet undergone
thorough evaluation in under-resourced languages such as Bengali (Bangla). In
this paper, we evaluate the performance of LLMs for the low-resourced Bangla
language. We select various important and diverse Bangla NLP tasks, such as
abstractive summarization, question answering, paraphrasing, natural language
inference, text classification, and sentiment analysis for zero-shot evaluation
with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with
state-of-the-art fine-tuned models. Our experimental results demonstrate an
inferior performance of LLMs for different Bangla NLP tasks, calling for
further effort to develop better understanding of LLMs in low-resource
languages like Bangla.Comment: First two authors contributed equall
Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks
Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively
- …