77 research outputs found
Survey of Arabic Checker Techniques
It is known that the importance of spell checking, which increases with the expanding of technologies, using the Internet and the local dialects, in addition to non-awareness of linguistic language. So, this importance increases with the Arabic language, which has many complexities and specificities that differ from other languages. This paper explains these specificities and presents the existing works based on techniques categories that are used, as well as explores these techniques. Besides, it gives directions for future work
From Arabic user-generated content to machine translation: integrating automatic error correction
With the wide spread of the social media and online forums,
individual users have been able to actively participate in the generation
of online content in different languages and dialects. Arabic is one of the
fastest growing languages used on Internet, but dialects (like Egyptian
and Saudi Arabian) have a big share of the Arabic online content. There
are many differences between Dialectal Arabic and Modern Standard
Arabic which cause many challenges for Machine Translation of informal
Arabic language. In this paper, we investigate the use of Automatic Error Correction method to improve the quality of Arabic User-Generated
texts and its automatic translation. Our experiments show that the new
system with automatic correction module outperforms the baseline system by nearly 22.59% of relative improvement
Automatic Correction of Arabic Dyslexic Text
This paper proposes an automatic correction system that detects and corrects dyslexic errors in Arabic text. The system uses a language model based on the Prediction by Partial Matching (PPM) text compression scheme that generates possible alternatives for each misspelled word. Furthermore, the generated candidate list is based on edit operations (insertion, deletion, substitution and transposition), and the correct alternative for each misspelled word is chosen on the basis of the compression codelength of the trigram. The system is compared with widely-used Arabic word processing software and the Farasa tool. The system provided good results compared with the other tools, with a recall of 43%, precision 89%, F1 58% and accuracy 81%
Grammatical Error Correction: A Survey of the State of the Art
Grammatical Error Correction (GEC) is the task of automatically detecting and
correcting errors in text. The task not only includes the correction of
grammatical errors, such as missing prepositions and mismatched subject-verb
agreement, but also orthographic and semantic errors, such as misspellings and
word choice errors respectively. The field has seen significant progress in the
last decade, motivated in part by a series of five shared tasks, which drove
the development of rule-based methods, statistical classifiers, statistical
machine translation, and finally neural machine translation systems which
represent the current dominant state of the art. In this survey paper, we
condense the field into a single article and first outline some of the
linguistic challenges of the task, introduce the most popular datasets that are
available to researchers (for both English and other languages), and summarise
the various methods and techniques that have been developed with a particular
focus on artificial error generation. We next describe the many different
approaches to evaluation as well as concerns surrounding metric reliability,
especially in relation to subjective human judgements, before concluding with
an overview of recent progress and suggestions for future work and remaining
challenges. We hope that this survey will serve as comprehensive resource for
researchers who are new to the field or who want to be kept apprised of recent
developments
Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
We present Dolphin, a novel benchmark that addresses the need for an
evaluation framework for the wide collection of Arabic languages and varieties.
The proposed benchmark encompasses a broad range of 13 different NLG tasks,
including text summarization, machine translation, question answering, and
dialogue generation, among others. Dolphin comprises a substantial corpus of 40
diverse and representative public datasets across 50 test splits, carefully
curated to reflect real-world scenarios and the linguistic richness of Arabic.
It sets a new standard for evaluating the performance and generalization
capabilities of Arabic and multilingual models, promising to enable researchers
to push the boundaries of current methodologies. We provide an extensive
analysis of Dolphin, highlighting its diversity and identifying gaps in current
Arabic NLG research. We also evaluate several Arabic and multilingual models on
our benchmark, allowing us to set strong baselines against which researchers
can compare
Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation
Understanding Arabic text and generating human-like responses is a
challenging endeavor. While many researchers have proposed models and solutions
for individual problems, there is an acute shortage of a comprehensive Arabic
natural language generation toolkit that is capable of handling a wide range of
tasks. In this work, we present a novel Arabic text-to-text Transformer model,
namely AraT5v2. Our new model is methodically trained on extensive and diverse
data, utilizing an extended sequence length of 2,048 tokens. We explore various
pretraining strategies including unsupervised, supervised, and joint
pertaining, under both single and multitask settings. Our models outperform
competitive baselines with large margins. We take our work one step further by
developing and publicly releasing Octopus, a Python-based package and
command-line toolkit tailored for eight Arabic generation tasks all exploiting
a single model. We release the models and the toolkit on our public repository
- …