139 research outputs found
BERT Embeddings for Automatic Readability Assessment
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with handcrafted linguistic features through a combined method for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, obtaining as high as 12.4% increase in F1 performance. We also show that the general information encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task
BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages
Current research on automatic readability assessment (ARA) has focused on
improving the performance of models in high-resource languages such as English.
In this work, we introduce and release BasahaCorpus as part of an initiative
aimed at expanding available corpora and baseline models for readability
assessment in lower resource languages in the Philippines. We compiled a corpus
of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and
Rinconada -- languages belonging to the Central Philippine family tree subgroup
-- to train ARA models using surface-level, syllable-pattern, and n-gram
overlap features. We also propose a new hierarchical cross-lingual modeling
approach that takes advantage of a language's placement in the family tree to
increase the amount of available training data. Our study yields encouraging
results that support previous work showcasing the efficacy of cross-lingual
models in low-resource settings, as well as similarities in highly informative
linguistic features for mutually intelligible languages.Comment: Final camera-ready paper for EMNLP 2023 (Main
Age Recommendation from Texts and Sentences for Children
Children have less text understanding capability than adults. Moreover, this
capability differs among the children of different ages. Hence, automatically
predicting a recommended age based on texts or sentences would be a great
benefit to propose adequate texts to children and to help authors writing in
the most appropriate way. This paper presents our recent advances on the age
recommendation task. We consider age recommendation as a regression task, and
discuss the need for appropriate evaluation metrics, study the use of
state-of-the-art machine learning model, namely Transformers, and compare it to
different models coming from the literature. Our results are also compared with
recommendations made by experts. Further, this paper deals with preliminary
explainability of the age prediction model by analyzing various linguistic
features. We conduct the experiments on a dataset of 3, 673 French texts (132K
sentences, 2.5M words). To recommend age at the text level and sentence level,
our best models achieve MAE scores of 0.98 and 1.83 respectively on the test
set. Also, compared to the recommendations made by experts, our sentence-level
recommendation model gets a similar score to the experts, while the text-level
recommendation model outperforms the experts by an MAE score of 1.48.Comment: 26 pages (incl. 4 pages for appendices), 4 figures, 20 table
Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language
The Bangla linguistic variety is a fascinating mix of regional dialects that
adds to the cultural diversity of the Bangla-speaking community. Despite
extensive study into translating Bangla to English, English to Bangla, and
Banglish to Bangla in the past, there has been a noticeable gap in translating
Bangla regional dialects into standard Bangla. In this study, we set out to
fill this gap by creating a collection of 32,500 sentences, encompassing
Bangla, Banglish, and English, representing five regional Bangla dialects. Our
aim is to translate these regional dialects into standard Bangla and detect
regions accurately. To achieve this, we proposed models known as mT5 and
BanglaT5 for translating regional dialects into standard Bangla. Additionally,
we employed mBERT and Bangla-bert-base to determine the specific regions from
where these dialects originated. Our experimental results showed the highest
BLEU score of 69.06 for Mymensingh regional dialects and the lowest BLEU score
of 36.75 for Chittagong regional dialects. We also observed the lowest average
word error rate of 0.1548 for Mymensingh regional dialects and the highest of
0.3385 for Chittagong regional dialects. For region detection, we achieved an
accuracy of 85.86% for Bangla-bert-base and 84.36% for mBERT. This is the first
large-scale investigation of Bangla regional dialects to Bangla machine
translation. We believe our findings will not only pave the way for future work
on Bangla regional dialects to Bangla machine translation, but will also be
useful in solving similar language-related challenges in low-resource language
conditions
Authorship Classification in a Resource Constraint Language Using Convolutional Neural Networks
Authorship classification is a method of automatically determining the appropriate author of an unknown linguistic text. Although research on authorship classification has significantly progressed in high-resource languages, it is at a primitive stage in the realm of resource-constraint languages like Bengali. This paper presents an authorship classification approach made of Convolution Neural Networks (CNN) comprising four modules: embedding model generation, feature representation, classifier training and classifier testing. For this purpose, this work develops a new embedding corpus (named WEC) and a Bengali authorship classification corpus (called BACC-18), which are more robust in terms of authors’ classes and unique words. Using three text embedding techniques (Word2Vec, GloVe and FastText) and combinations of different hyperparameters, 90 embedding models are created in this study. All the embedding models are assessed by intrinsic evaluators and those selected are the 9 best performing models out of 90 for the authorship classification. In total 36 classification models, including four classification models (CNN, LSTM, SVM, SGD) and three embedding techniques with 100, 200 and 250 embedding dimensions, are trained with optimized hyperparameters and tested on three benchmark datasets (BACC-18, BAAD16 and LD). Among the models, the optimized CNN with GloVe model achieved the highest classification accuracies of 93.45%, 95.02%, and 98.67% for the datasets BACC-18, BAAD16, and LD, respectively
Character Recognition
Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
- …