7 research outputs found
Pratique de la lecture en thaĂŻ et hindi en L2 : classification automatique de textes par progression lexicale
International audienceThis article looks at the creation of teaching and learning resources for less commonly taught languages from unsimplified texts. The inspiration for this study comes from Ghadirian (2002) and the associated computer program TextLadder. The program classifies a series of texts by their lexical similarity, introducing target vocabulary incrementally and thus making reading easier for the learner. This kind of automated text sequencing can be used to select sequences of texts appropriate to the level of lexical competence of the L2 reader, whether for independent readers or for creating teaching material for classroom use. The method is particularly suitable for classifying texts with a similar topic or theme.Cet article a pour objet la création automatique de ressources pour l’apprentissage de langues étrangères peu enseignées et peu dotées en matériels pédagogiques à partir de textes authentiques. Il s'inspire du travail de Ghadirian (2002) et son logiciel TextLadder, une application qui classifie les textes d’un corpus selon un ordre qui maximise la facilité de lecture pour l’apprenant, en calculant la similarité lexicale entre les textes. La classification automatique de textes par progression lexicale constitue une méthode intéressante pour proposer une séquence de textes appropriée au niveau d’un lecteur en L2, aussi bien pour proposer des textes à des lecteurs autonomes que pour la création de matériels pédagogiques destinés à être utilisés en classe. Cette méthode est spécialement bien adaptée à la classification de textes qui portent sur une thématique particulière
Age Recommendation from Texts and Sentences for Children
Children have less text understanding capability than adults. Moreover, this
capability differs among the children of different ages. Hence, automatically
predicting a recommended age based on texts or sentences would be a great
benefit to propose adequate texts to children and to help authors writing in
the most appropriate way. This paper presents our recent advances on the age
recommendation task. We consider age recommendation as a regression task, and
discuss the need for appropriate evaluation metrics, study the use of
state-of-the-art machine learning model, namely Transformers, and compare it to
different models coming from the literature. Our results are also compared with
recommendations made by experts. Further, this paper deals with preliminary
explainability of the age prediction model by analyzing various linguistic
features. We conduct the experiments on a dataset of 3, 673 French texts (132K
sentences, 2.5M words). To recommend age at the text level and sentence level,
our best models achieve MAE scores of 0.98 and 1.83 respectively on the test
set. Also, compared to the recommendations made by experts, our sentence-level
recommendation model gets a similar score to the experts, while the text-level
recommendation model outperforms the experts by an MAE score of 1.48.Comment: 26 pages (incl. 4 pages for appendices), 4 figures, 20 table
An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach
This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models.
Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier.
TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97.
To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes