3 research outputs found

    An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach

    Get PDF
    This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models. Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier. TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97. To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes

    A Monolingual Parallel Corpus of Arabic

    No full text

    A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

    No full text
    Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language.This data article presents a monolingual parallel corpus of Arabic text called A7׳ta (). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts. Keywords: Error checking, Arabic language, NLP, Parallel corpu
    corecore