217 research outputs found
Recommended from our members
Automatic Identification of Errors in Arabic Handwriting Recognition
Arabic handwriting recognition (HR) is a challenging problem due to Arabic's connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. We also consider a learning curve varying in two dimensions: number of segments and number of n-best hypotheses to train on. We additionally evaluate the performance on different test sets with different degrees of errors in them. Our best approach achieves a roughly ~20% absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features, such as lemma models, help improve HR-error detection precisely where we expect them to: semantically inconsistent error words
Recommended from our members
CATiB: The Columbia Arabic Treebank
The Columbia Arabic Treebank (CATiB) is a resource for Arabic parsing. CATiB contrasts with previous efforts on Arabic treebanking and treebanking of morphologically rich languages in that it encodes less linguistic information in the interest of speedier annotation of large amounts of text. This paper describes CATiB's representation and annotation procedure, and reports on achieved inter-annotator agreement and annotation speed
Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation
We present four techniques for online handling of Out-of-Vocabulary words in Phrasebased Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis
LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual
The Linguistic Data Consortium (LDC) has developed hundreds of data corpora
for natural language processing (NLP) research. Among these are a number of
annotated treebank corpora for Arabic. Typically, these corpora consist of a
single collection of annotated documents. NLP research, however, usually
requires multiple data sets for the purposes of training models, developing
techniques, and final evaluation. Therefore it becomes necessary to divide the
corpora used into the required data sets (divisions). This document details a
set of rules that have been defined to enable consistent divisions for old and
new Arabic treebanks (ATB) and related corpora.Comment: 14 pages; one cove
Arabic Preprocessing Schemes for Statistical Machine Translation
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data
Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic.
Abstract Modern Standard Arabic (MSA) has a wealth of natural language processing (NLP) tools and resources. In comparison, resources for dialectal Arabic (DA), the unstandardized spoken varieties of Arabic, are still lacking. We present ELISSA, a machine translation (MT) system for DA to MSA. ELISSA employs a rule-based approach that relies on morphological analysis, transfer rules and dictionaries in addition to language models to produce MSA paraphrases of DA sentences. ELISSA can be employed as a general preprocessor for DA when using MSA NLP tools. A manual error analysis of ELISSA's output shows that it produces correct MSA translations over 93% of the time. Using ELISSA to produce MSA versions of DA sentences as part of an MSA-pivoting DA-to-English MT solution, improves BLEU scores on multiple blind test sets between 0.6% and 1.4%
- …