216 research outputs found

    Building the Arabic Learner Corpus and a System for Arabic Error Annotation

    Get PDF
    Recent developments in learner corpora have highlighted the growing role they play in some linguistic and computational research areas such as language teaching and natural language processing. However, there is a lack of a well-designed Arabic learner corpus that can be used for studies in the aforementioned research areas. This thesis aims to introduce a detailed and original methodology for developing a new learner corpus. This methodology which represents the major contribution of the thesis includes a combination of resources, proposed standards and tools developed for the Arabic Learner Corpus project. The resources include the Arabic Learner Corpus, which is the largest learner corpus for Arabic based on systematic design criteria. The resources also include the Error Tagset of Arabic that was designed for annotating errors in Arabic covering 29 types of errors under five broad categories. The Guide on Design Criteria for Learner Corpus is an example of the proposed standards which was created based on a review of previous work. It focuses on 11 aspects of corpus design criteria. The tools include the Computer-aided Error Annotation Tool for Arabic that provides some functions facilitating error annotation such as the smart-selection function and the auto-tagging function. Additionally, the tools include the ALC Search Tool that is developed to enable searching the ALC and downloading the source files based on a number of determinants. The project was successfully able to recruit 992 people including language learners, data collectors, evaluators, annotators and collaborators from more than 30 educational institutions in Saudi Arabia and the UK. The data of the Arabic Learner Corpus was used in a number of projects for different purposes including error detection and correction, native language identification, Arabic analysers evaluation, applied linguistics studies and data-driven Arabic learning. The use of the ALC highlights the extent to which it is important to develop this project

    Annotating an Arabic Learner Corpus for Error

    Get PDF
    This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level student writings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the error frequency distribution of both proficiency levels and the ongoing work

    ARIDA: An Arabic Interlanguage Database and Its Applications: A Pilot Study

    Get PDF
    This paper describes a pilot study in which we collected a small learner corpus of Arabic, developed a tagset for error-annotation of Arabic learner data, tagged the data for error 1, and performed simple Computer-aided Error Analysis (CEA)

    Towards error annotation in a learner corpus of Portuguese

    Get PDF
    In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of functionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in xml format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation.info:eu-repo/semantics/publishedVersio

    ORTHOGRAPHIC ENRICHMENT FOR ARABIC GRAMMATICAL ANALYSIS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Linguistics, 2010The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relations between the words in a sentence. The study uses the memory-based algorithm for vocalization, segmentation, and part of speech tagging, and the natural language parser MaltParser for dependency parsing. The thesis represents the first approach to the processing of real-world Arabic, and has found that through the correct choice of features and algorithms, the need for pre-processing for grammatical analysis can be minimized

    An evaluation of the Arabic error tagset v2

    Get PDF
    A survey of the literature shows that annotating errors of Arabic learners has not received much attention, and there is a need for a practical error tagset which can be used for Arabic learner corpora. This type of tagset is used in such corpora for several purposes, e.g., Contrastive Interlanguage Analysis (CIA), learner dictionary making, Second Language Acquisition, designing pedagogical materials, etc. This paper evaluates the second version of a two-level error tagset developed for annotating the Arabic Learner Corpus (ALC). It includes six broad classes, subdivided into more specific error types. The paper shows the tagset, and an example of the annotation method used for tagging the ALC. The inter-annotator agreement using the current revised version of the error tagset was higher compared to the first version (Alfaifi et al., 2013). Four factors assisted in reaching this level of accuracy: (1) the tagset was reviewed by two experts in Arabic language, (2) the annotators were given texts with errors already identified, so the ir task was to classify and mark the appropriate tag on each error, (3) the annotators were trained during the experiment, (4) an error tagging manual was created which explains all error types in the tagset with rules and examples of how to tag learners' errors. Two lists of varied sentences, 100 in each, were tagged for errors by three annotators; after tagging the first list they discussed their work to provide them with suitable training, and this allowed us to distinguish the value of the training among the other factors

    Calculating the error percentage of an automated part-of-speech tagger when analyzing Estonian learner English: an empirical analysis

    Get PDF
    Teksti sĂ”naliikideks jaotamine sĂŒndis koos lingvistikaga, kuid selle protsessi automatiseerimine on muutunud vĂ”imalikuks alles viimastel kĂŒmnenditel ning seda tĂ€nu arvutite vĂ”imsuse kasvule. Tekstitöötluse algoritmid on alates sellest ajast iga aastaga ĂŒha paranenud. Selle magistritöö raames pannakse ĂŒks selle valdkonna lipulaevadest proovile korpuse peal, mis hĂ”lmab eesti keelt emakeelena kĂ”nelevate inglise keele Ă”ppijate tekste (TCELE korpus). Korpuse suurus on antud hetkel ca. 25 000 sĂ”na (127 kirjalikku esseed) ning 11 transkribeeritud intervjuud (~100 minutit). EesmĂ€rk on hinnata TCELE ja muude sarnaste korpuste veaprotsenti. Töö esimeses osas tutvustatakse lugejale korpuse kokkupanemist, annoteerimist ja vĂ€ljavĂ”tet (ingl. ​retrieval​ ) ning antakse ĂŒlevaade sĂ”naliikide mÀÀramisest ja veaprotsendist. PĂ€rast seda antakse ĂŒlevaade varasematest uuringutest ning vastatakse muuhulgas, jĂ€rgnevatele kĂŒsimustele: mida on eelnevalt tehtud? Mis olid uuringute leiud? Millised automaatsed mĂ€rgendajad (ingl. ​taggers) ja sĂ”naliikide loendeid (ingl. ​tagset​ ) kasutati?http://www.ester.ee/record=b5142572*es

    Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

    Get PDF
    This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar
    • 

    corecore