227 research outputs found

    Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

    Full text link
    Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.Comment: Add affiliation and email addres

    Errors in inflectional morphemes as an index of linguistic competence of Korean Heritage language learners and American learners of Korean

    Get PDF
    This study examined the linguistic competence in Korean of Korean heritage language learners (HLLs), compared to English-speaking non-heritage language learners (NHLLs) of Korean. It is unclear and controversial as to whether heritage languages learners are exposed to early but are interrupted manifest as L1 competence or share more characteristics with development in L2/FL competence. However, a common misconception is that HLLs outperform NHLLs in overall language skills even though Korean HLLs in Korean as a Foreign Language (KFL) classes do not make better progress than NHLLs despite their comparatively stronger aural interpretive abilities. This study was designed to investigate whether HLLs have an advantage over NHLLs in learning distinctive parametric values in Korean language, through comparing occurrences and sources of grammatical errors exhibited by two groups taking university-level KFL classes. This study addresses Korean inflectional morphemes, with a focus on case and postposition markers and affixal connectives. Data was collected from error analysis (EA) of inflectional morpheme errors and its source on semi-guided and self-generated writing samples, and grammaticality judgment in a word completion (GJWC) test using the same inflectional morphemes used for the EA. Schlyter's Weak language (WL) as L2, Montrul's WL as L1, and the Missing Surface Inflection Hypothesis (MSIH) provided theoretical frameworks. The EA data was coded using the Systematic Analysis of Language Transcript program. The EA and GJWC data were analyzed using a 2-way ANOVA and, when there was a significant interaction effect between heritage status and language proficiency level, a 1-way ANOVA. This study's results confirmed Schlyter's hypothesis, but did not support Montrul's hypothesis from either the EA or GJWC. MSIH failed in explaining underlying linguistic competence of HLLs. Significantly higher error rates caused by omitting necessary subject and object markers among HLLs imply their Korean morphological data stays at the level of Korean child's morphology. Significantly higher error rates in instrument marker in the GJWC test by advanced level of HLLs imply impaired Korean morphology of HLLs. Linguistic variation is more prominent among HLL group. Findings are further discussed in relation to their theoretical, methodological, and pedagogical implications. Differentiated instructional and curricular approaches for HLL and NHLL groups are suggested

    Problems in Evaluating Grammatical Error Detection Systems

    Get PDF
    ABSTRACT Many evaluation issues for grammatical error detection have previously been overlooked, making it hard to draw meaningful comparisons between different approaches, even when they are evaluated on the same corpus. To begin with, the three-way contingency between a writer's sentence, the annotator's correction, and the system's output makes evaluation more complex than in some other NLP tasks, which we address by presenting an intuitive evaluation scheme. Of particular importance to error detection is the skew of the data -the low frequency of errors as compared to non-errors -which distorts some traditional measures of performance and limits their usefulness, leading us to recommend the reporting of raw measurements (true positives, false negatives, false positives, true negatives). Other issues that are particularly vexing for error detection focus on defining these raw measurements: specifying the size or scope of an error, properly treating errors as graded rather than discrete phenomena, and counting non-errors. We discuss recommendations for best practices with regard to reporting the results of system evaluation for these cases, recommendations which depend upon making clear one's assumptions and applications for error detection. By highlighting the problems with current error detection evaluation, the field will be better able to move forward

    The written production of ecuadorian efl high school students: grammatical transfer errors and teacher's and learner's perception of feedback

    Get PDF
    346 p.El objetivo de esta tesis doctoral es investigar los errores gramaticales de transferencia en la escritura en inglĂ©s como lengua extranjera de alumnos de secundaria ecuatorianos (n=180) y su grado de prevalencia en comparaciĂłn a los errores lĂ©xicos de transferencia. AsĂ­ mismo, se intenta comparar la variaciĂłn de los errores gramaticales de transferencia obtenidos en tres grupos de alumnos clasificados de acuerdo a su nivel de dominio de inglĂ©s segĂșn el Marco ComĂșn Europeo (A1, A2, B1) y la variaciĂłn de dichos errores entre dos tipos de ensayo: narrativo y argumentativo. Finalmente, se desea conocer las percepciones de los estudiantes y profesores con respecto a la retroalimentaciĂłn en la escritura de inglĂ©s como lengua extranjera proporcionada en las clases. Todo esto se realiza con el propĂłsito de contribuir a tratar de cumplir una parte de las metas del Ministerio de EducaciĂłn del Ecuador relacionadas a la bĂșsqueda de una mejora en el nivel de dominio de inglĂ©s como lengua extranjera en estudiantes de educaciĂłn secundaria

    Corrective Feedback in the EFL Classroom: Grammar Checker vs. Teacher’s Feedback.

    Get PDF
    The aim of this doctoral thesis is to compare the feedback provided by the teacher to that obtained by the software called Grammar Checker on grammatical errors in the written production of English as a foreign language students. Traditionally, feedback has been considered as one of the three theoretical conditions for language learning (along with input and output) and, for this reason, extensive research has been carried out on who should provide it, when and the level of explicitness. However, there are far fewer studies that analyse the use of e-feedback programs as a complement or alternative to those offered by the teacher. Participants in our study were divided into two experimental groups and one control group, and three grammatical aspects that are usually susceptible to error in English students at B2 level were examined: prepositions, articles, and simple past-present/past perfect dichotomy. All participants had to write four essays. The first experimental group received feedback from the teacher and the second received it through the Grammar Checker program. The control group did not get feedback on the grammatical aspects of the analysis but on other linguistic forms not studied. The results obtained point, first of all, to the fact that the software did not mark grammatical errors in some cases. This means that students were unable to improve their written output in terms of linguistic accuracy after receiving feedback from the program. In contrast, students who received feedback from the teacher did improve, although the difference was not significant. Second, the two experimental groups outperformed the control group in the use of the grammatical forms under analysis. Thirdly, regardless of the feedback offered, the two groups showed improvement in the use of grammatical aspects in the long term, and finally, no differences in attitude towards the feedback received and its impact on the results were found in either of the experimental groups. Our results open up new lines for investigating corrective feedback in the English as a foreign language classroom, since more studies are needed that, on the one hand, influence the improvement of electronic feedback programs by making them more accurate and effective in the detection of errors. On the other hand, software such as Grammar Checker can be a complement to the daily practice of the foreign language teacher, helping in the first instance to correct common and recurring mistakes, even more so when our research has shown that attitudes towards this type of electronic feedback are positive and does not imply an intrusion into the classroom, thus helping in the acquisition of the English language.Programa de Doctorat en LlengĂŒes Aplicades, Literatura i Traducci

    AN INVESTIGATION OF STUDENTS' EXPERIENCES WITH A WEB-BASED, DATA-DRIVEN WRITING ASSISTANCE ENVIRONMENT FOR IMPROVING KOREAN EFL WRITERS' ACCURACY WITH ENGLISH GRAMMAR AND VOCABULARY

    Get PDF
    Computer-assisted language learning (CALL) has played an increasingly important role in writing instruction and research. While research has been conducted on English as a second language (ESL) learners and the benefits of using web-based writing assistance programs in writing instruction, insufficient research has been done on English as a foreign language (EFL) students. This study is an empirical investigation of students' experiences with a web-based, data-driven writing assistance environment (e4writing) designed by the researcher to help Korean EFL writers with their grammar and vocabulary. This study investigated Korean university students' perceived difficulties with English grammar and vocabulary as they wrote in English. It also explored their perceptions of e4writing as used in a writing course to enhance English grammar and vocabulary. This study investigated 12 participants' perceptions and "academic profiles" (learning styles, confidence, motivation, and other factors) when they were enrolled in a 16-week course called Teaching Methods for English Composition. To gain a more specific and personal view, the study also included detailed case studies of four of the study participants. The major sources of data for the analyses include interviews, reflective journals, questionnaires, samples of the students' writing before and after their use of e4writing and the researcher's reflective notes. The study revealed that most of the students had difficulty with grammar and vocabulary in English writing. They positively perceived e4writing, as it provided individualized help on their problems with grammar and lexis. Overall, the students showed improvement in accuracy from the pretest to the posttest, and observations suggested that e4writing was probably related to this improvement; however, strong claims about e4writing as a cause of improvement cannot be made without a control group. The students felt e4writing was more beneficial for improving grammatical accuracy than for vocabulary accuracy. The students recommended that some features of e4writing be written in Korean to help students understand grammar and vocabulary explanations

    Detecting grammatical errors with treebank-induced, probabilistic parsers

    Get PDF
    Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements

    English speakers' common orthographic errors in Arabic as L2 writing system : an analytical case study

    Get PDF
    PhD ThesisThe research involving Arabic Writing System (WS) is quite limited. Yet, researching writing errors of L2WS Arabic against a certain L1WS seems to be relatively neglected. This study attempts to identify, describe, and explain common orthographic errors in Arabic writing amongst English-speaking learners. First, it outlines the Arabic Writing System’s (AWS) characteristics and available empirical studies of L2WS Arabic. This study embraced the Error Analysis approach, utilising a mixed-method design that deployed quantitative and qualitative tools (writing tests, questionnaire, and interview). The data were collected from several institutions around the UK, which collectively accounted for 82 questionnaire responses, 120 different writing samples from 44 intermediate learners, and six teacher interviews. The hypotheses for this research were; a) English-speaking learners of Arabic make common orthographic errors similar to those of Arabic native speakers; b) English-speaking learners share several common orthographic errors with other learners of Arabic as a second/foreign language (AFL); and c) English-speaking learners of Arabic produce their own common orthographic errors which are specifically related to the differences between the two WSs. The results confirmed all three hypotheses. Specifically, English-speaking learners of L2WS Arabic commonly made six error types: letter ductus (letter shape), orthography (spelling), phonology, letter dots, allographemes (i.e. letterform), and direction. Gemination and L1WS transfer error rates were not found to be major. Another important result showed that five letter groups in addition to two letters are particularly challenging to English-speaking learners. Study results indicated that error causes were likely to be from one of four factors: script confusion, orthographic difficulties, phonological realisation, and teaching/learning strategies. These results are generalizable as the data were collected from several institutions in different parts of the UK. Suggestions and implications as well as recommendations for further research are outlined accordingly in the conclusion chapter

    VALICO-UD: annotating an Italian learner corpus

    Get PDF
    Previous work on learner language has highlighted the importance of having annotated resources to describe the development of interlanguage. Despite this, few learner resources, mainly for English L2, feature error and syntactic annotation. This thesis describes the development of a novel parallel learner Italian treebank, VALICO-UD. Its name suggests two main points: where the data comes from—i.e. the corpus VALICO, a collection of non-native Italian texts elicited by comic strips—and what formalism is used for linguistic annotation—i.e. Universal Dependencies (UD) formalism. It is a parallel treebank because the resource provides for each learner sentence (LS) a target hypothesis (TH) (i.e., parallel corrected version written by an Italian native speaker) which is in turn annotated in UD. We developed this treebank to be exploitable for interlanguage research and comparable with the resources employed in Natural Language Processing tasks such as Native Language Identification or Grammatical Error Identification and Correction. VALICO-UD is composed of 237 texts written by English, French, German and Spanish native speakers, which correspond to 2,234 LSs, each associated with a single TH. While all LSs and THs were automatically annotated using UDPipe, only a portion of the treebank made of 398 LSs plus correspondent THs has been manually corrected and released in May 2021 in the UD repository. This core section features also an explicit XML-based annotation of the errors occurring in each sentence. Thus, the treebank is currently organized in two sections: the core gold standard—comprising 398 LSs and their correspondent THs—and the silver standard—consisting of 1,836 LSs and their correspondent THs. In order to contribute to the computational investigation about the peculiar type of texts included in VALICO-UD, this thesis describes the annotation schema of the resource, provides some preliminary tests about the performance of UDPipe models on this treebank, reports on inter-annotator agreement results for both error and linguistic annotation, and suggests some possible applications
    • 

    corecore