Search CORE

216 research outputs found

Building the Arabic Learner Corpus and a System for Arabic Error Annotation

Author: Alfaifi Abdullah Yahya G.
Publication venue: University of Leeds
Publication date: 01/05/2015
Field of study

Recent developments in learner corpora have highlighted the growing role they play in some linguistic and computational research areas such as language teaching and natural language processing. However, there is a lack of a well-designed Arabic learner corpus that can be used for studies in the aforementioned research areas. This thesis aims to introduce a detailed and original methodology for developing a new learner corpus. This methodology which represents the major contribution of the thesis includes a combination of resources, proposed standards and tools developed for the Arabic Learner Corpus project. The resources include the Arabic Learner Corpus, which is the largest learner corpus for Arabic based on systematic design criteria. The resources also include the Error Tagset of Arabic that was designed for annotating errors in Arabic covering 29 types of errors under five broad categories. The Guide on Design Criteria for Learner Corpus is an example of the proposed standards which was created based on a review of previous work. It focuses on 11 aspects of corpus design criteria. The tools include the Computer-aided Error Annotation Tool for Arabic that provides some functions facilitating error annotation such as the smart-selection function and the auto-tagging function. Additionally, the tools include the ALC Search Tool that is developed to enable searching the ALC and downloading the source files based on a number of determinants. The project was successfully able to recruit 992 people including language learners, data collectors, evaluators, annotators and collaborators from more than 30 educational institutions in Saudi Arabia and the UK. The data of the Arabic Learner Corpus was used in a number of projects for different purposes including error detection and correction, native language identification, Arabic analysers evaluation, applied linguistics studies and data-driven Arabic learning. The use of the ALC highlights the extent to which it is important to develop this project

White Rose E-theses Online

Annotating an Arabic Learner Corpus for Error

Author: Abuhakema Ghazi
Faraj Reem
Feldman Anna
Fitzpatrick Eileen
Publication venue: Montclair State University Digital Commons
Publication date: 01/01/2008
Field of study

This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level student writings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the error frequency distribution of both proficiency levels and the ongoing work

CiteSeerX

Montclair State University Digital Commons

ARIDA: An Arabic Interlanguage Database and Its Applications: A Pilot Study

Author: Abuhakema Ghazi
Feldman Anna
Fitzpatrick Eileen
Publication venue: Montclair State University Digital Commons
Publication date: 17/11/2008
Field of study

This paper describes a pilot study in which we collected a small learner corpus of Arabic, developed a tagset for error-annotation of Arabic learner data, tagged the data for error 1, and performed simple Computer-aided Error Analysis (CEA)

Montclair State University Digital Commons

Directory of Open Access Journals

Compression-based Parts-of-Speech Tagger for the Arabic Language

Author: Alkhazi Ibrahim
Publication venue
Publication date: 18/12/2019
Field of study

Bangor University Research Portal

Towards error annotation in a learner corpus of Portuguese

Author: Antunes Sandra
del Río Iria
Janssen Maarten
Mendes Amália
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2016
Field of study

In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of functionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in xml format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

ORTHOGRAPHIC ENRICHMENT FOR ARABIC GRAMMATICAL ANALYSIS

Author: Mohamed Emad Soliman
Publication venue: [Bloomington, Ind.] : Indiana University
Publication date: 01/01/2010
Field of study

Thesis (Ph.D.) - Indiana University, Linguistics, 2010The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relations between the words in a sentence. The study uses the memory-based algorithm for vocalization, segmentation, and part of speech tagging, and the natural language parser MaltParser for dependency parsing. The thesis represents the first approach to the processing of real-world Arabic, and has found that through the correct choice of features and algorithms, the need for pre-processing for grammatical analysis can be minimized

IUScholarWorks (University of Indiana)

An evaluation of the Arabic error tagset v2

Author: Alfaifi A
Atwell ES
Publication venue: The American Association for Corpus Linguistics
Publication date: 01/01/2014
Field of study

A survey of the literature shows that annotating errors of Arabic learners has not received much attention, and there is a need for a practical error tagset which can be used for Arabic learner corpora. This type of tagset is used in such corpora for several purposes, e.g., Contrastive Interlanguage Analysis (CIA), learner dictionary making, Second Language Acquisition, designing pedagogical materials, etc. This paper evaluates the second version of a two-level error tagset developed for annotating the Arabic Learner Corpus (ALC). It includes six broad classes, subdivided into more specific error types. The paper shows the tagset, and an example of the annotation method used for tagging the ALC. The inter-annotator agreement using the current revised version of the error tagset was higher compared to the first version (Alfaifi et al., 2013). Four factors assisted in reaching this level of accuracy: (1) the tagset was reviewed by two experts in Arabic language, (2) the annotators were given texts with errors already identified, so the ir task was to classify and mark the appropriate tag on each error, (3) the annotators were trained during the experiment, (4) an error tagging manual was created which explains all error types in the tagset with rules and examples of how to tag learners' errors. Two lists of varied sentences, 100 in each, were tagged for errors by three annotators; after tagging the first list they discussed their work to provide them with suitable training, and this allowed us to distinguish the value of the training among the other factors

White Rose Research Online

Calculating the error percentage of an automated part-of-speech tagger when analyzing Estonian learner English: an empirical analysis

Author: Undo Aare
Publication venue: Tartu Ülikool
Publication date: 01/01/2018
Field of study

Teksti sõnaliikideks jaotamine sündis koos lingvistikaga, kuid selle protsessi automatiseerimine on muutunud võimalikuks alles viimastel kümnenditel ning seda tänu arvutite võimsuse kasvule. Tekstitöötluse algoritmid on alates sellest ajast iga aastaga üha paranenud. Selle magistritöö raames pannakse üks selle valdkonna lipulaevadest proovile korpuse peal, mis hõlmab eesti keelt emakeelena kõnelevate inglise keele õppijate tekste (TCELE korpus). Korpuse suurus on antud hetkel ca. 25 000 sõna (127 kirjalikku esseed) ning 11 transkribeeritud intervjuud (~100 minutit). Eesmärk on hinnata TCELE ja muude sarnaste korpuste veaprotsenti. Töö esimeses osas tutvustatakse lugejale korpuse kokkupanemist, annoteerimist ja väljavõtet (ingl. retrieval ) ning antakse ülevaade sõnaliikide määramisest ja veaprotsendist. Pärast seda antakse ülevaade varasematest uuringutest ning vastatakse muuhulgas, järgnevatele küsimustele: mida on eelnevalt tehtud? Mis olid uuringute leiud? Millised automaatsed märgendajad (ingl. taggers) ja sõnaliikide loendeid (ingl. tagset ) kasutati?http://www.ester.ee/record=b5142572*es

DSpace at Tartu University Library

Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

Author: Cocks John
Publication venue: 'University of Waikato'
Publication date: 16/03/2012
Field of study

This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar

Research Commons@Waikato