168 research outputs found
BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment in Central Philippine Languages
Current research on automatic readability assessment (ARA) has focused on
improving the performance of models in high-resource languages such as English.
In this work, we introduce and release BasahaCorpus as part of an initiative
aimed at expanding available corpora and baseline models for readability
assessment in lower resource languages in the Philippines. We compiled a corpus
of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and
Rinconada -- languages belonging to the Central Philippine family tree subgroup
-- to train ARA models using surface-level, syllable-pattern, and n-gram
overlap features. We also propose a new hierarchical cross-lingual modeling
approach that takes advantage of a language's placement in the family tree to
increase the amount of available training data. Our study yields encouraging
results that support previous work showcasing the efficacy of cross-lingual
models in low-resource settings, as well as similarities in highly informative
linguistic features for mutually intelligible languages.Comment: Final camera-ready paper for EMNLP 2023 (Main
Age Recommendation from Texts and Sentences for Children
Children have less text understanding capability than adults. Moreover, this
capability differs among the children of different ages. Hence, automatically
predicting a recommended age based on texts or sentences would be a great
benefit to propose adequate texts to children and to help authors writing in
the most appropriate way. This paper presents our recent advances on the age
recommendation task. We consider age recommendation as a regression task, and
discuss the need for appropriate evaluation metrics, study the use of
state-of-the-art machine learning model, namely Transformers, and compare it to
different models coming from the literature. Our results are also compared with
recommendations made by experts. Further, this paper deals with preliminary
explainability of the age prediction model by analyzing various linguistic
features. We conduct the experiments on a dataset of 3, 673 French texts (132K
sentences, 2.5M words). To recommend age at the text level and sentence level,
our best models achieve MAE scores of 0.98 and 1.83 respectively on the test
set. Also, compared to the recommendations made by experts, our sentence-level
recommendation model gets a similar score to the experts, while the text-level
recommendation model outperforms the experts by an MAE score of 1.48.Comment: 26 pages (incl. 4 pages for appendices), 4 figures, 20 table
Character Recognition
Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Mukhyansh: A Headline Generation Dataset for Indic Languages
The task of headline generation within the realm of Natural Language
Processing (NLP) holds immense significance, as it strives to distill the true
essence of textual content into concise and attention-grabbing summaries. While
noteworthy progress has been made in headline generation for widely spoken
languages like English, there persist numerous challenges when it comes to
generating headlines in low-resource languages, such as the rich and diverse
Indian languages. A prominent obstacle that specifically hinders headline
generation in Indian languages is the scarcity of high-quality annotated data.
To address this crucial gap, we proudly present Mukhyansh, an extensive
multilingual dataset, tailored for Indian language headline generation.
Comprising an impressive collection of over 3.39 million article-headline
pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu,
Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a
comprehensive evaluation of several state-of-the-art baseline models.
Additionally, through an empirical analysis of existing works, we demonstrate
that Mukhyansh outperforms all other models, achieving an impressive average
ROUGE-L score of 31.43 across all 8 languages.Comment: Accepted at PACLIC 202
Automatic Readability Assessment for Closely Related Languages
In recent years, the main focus of research on automatic readability
assessment (ARA) has shifted towards using expensive deep learning-based
methods with the primary goal of increasing models' accuracy. This, however, is
rarely applicable for low-resource languages where traditional handcrafted
features are still widely used due to the lack of existing NLP tools to extract
deeper linguistic representations. In this work, we take a step back from the
technical component and focus on how linguistic aspects such as mutual
intelligibility or degree of language relatedness can improve ARA in a
low-resource setting. We collect short stories written in three languages in
the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment
models and explore the interaction of data and features in various
cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel
specialized feature exploiting n-gram overlap applied to languages with high
mutual intelligibility, significantly improves the performance of ARA models
compared to the use of off-the-shelf large multilingual language models alone.
Consequently, when both linguistic representations are combined, we achieve
state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA
in Bikol.Comment: Camera-ready version for ACL 202
Automatic Readability Assessment for Closely Related Languages
In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models' accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol
- …