1,950 research outputs found

    A Review Study of Error Analysis Theory

    Get PDF
    Up until the late sixties, the prominent theory in the field of second language acquisition or learning was almost behaviouristic, which claimed that the learning was a result of acquiring a set of new language patterns. Hence, second language errors were considered as only the result of learners' mother tongue habits in the target language. Errors which were not explained based on this assumption will definitely be underestimated. Therefore, there was a need for another approach in order to clearly describe second language learners' errors. Given this, the current study aims at reviewing and discussing the Error Analysis theory in terms of theoretical foundations, theoretical assumptions, limitations and significance of this theory. This review reveals that despite the criticism that this theory has received, it still plays a fundamental role in investigating, identifying and describing second language learners' errors and their causes. Most importantly, Error Analysis can enable second language teachers to find out different sources of second language errors and take some pedagogical precautions towards them. Moreover, Error Analysis can provide a good methodology for investigating second language learners' errors. Once the causes or sources of errors are discovered, it is probable to conclude and decide on the remedy

    Non-native text analysis with Syntactic Diff, a general comparative text mining framework

    Get PDF
    Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising [1]. Online education in the form of MOOCs (massive online open courses) is also primarily in English— even teaching English. This creates enormous amounts of text written by non- native speakers, which in turn generates a need for grammar correction and analysis. Even aside from MOOCs, the number of English learners only in Asia alone is in the tens of millions. In response to this powerful motivation, we describe SYNTACTIC DIFF, a novel edit-based method for transforming sequences of words given a reference corpus. These transformations can be used directly or can be employed as features to represent text data in a wide variety of text mining scenarios. As case studies, we apply SYNTACTIC DIFF to four quite different tasks in non-native text analysis and show its benefit in each case. In the first task, we use weighted word edits with likelihood scoring for grammatical error correction. Our method is compared against systems in a grammar correction shared task, and we find that SYNTACTIC DIFF edits perform comparably while being much more general than the other methods. The second task is native language identification: a classification problem predicting the native language of a student writer based on English essays. We represent documents as vectors of edits, and show that a combination of unigram words and SYNTACTIC DIFF edits outperforms each representation individually. The third task is fluency scoring, in which we see if the manually categorized fluency levels of English students can be modeled by SYNTACTIC DIFF features. In the fourth task, we create clusters of student essays with similar errors via topic modeling, and find that the interpretability is significantly higher than an n-gram words approach. SYNTACTIC DIFF is highly customizable and able to capture syntactic differences from a reference corpus at the sentence, document, and subcorpus levels. This enables both a rich translation method and feature representation for many text mining tasks that deal with word usage and syntax beyond bag- of-words. In particular, this thesis focuses on non-native text analysis applications, though SYNTACTIC DIFF is not at all limited to that domain

    Monolingual Sentence Rewriting as Machine Translation: Generation and Evaluation

    Get PDF
    In this thesis, we investigate approaches to paraphrasing entire sentences within the constraints of a given task, which we call monolingual sentence rewriting. We introduce a unified framework for monolingual sentence rewriting, and apply it to three representative tasks: sentence compression, text simplification, and grammatical error correction. We also perform a detailed analysis of the evaluation methodologies for each task, identify bias in common evaluation techniques, and propose more reliable practices. Monolingual rewriting can be thought of as translating between two types of English (such as from complex to simple), and therefore our approach is inspired by statistical machine translation. In machine translation, a large quantity of parallel data is necessary to model the transformations from input to output text. Parallel bilingual data naturally occurs between common language pairs (such as English and French), but for monolingual sentence rewriting, there is little existing parallel data and annotation is costly. We modify the statistical machine translation pipeline to harness monolingual resources and insights into task constraints in order to drastically diminish the amount of annotated data necessary to train a robust system. Our method generates more meaning-preserving and grammatical sentences than earlier approaches and requires less task-specific data. Once candidate sentences are generated, it is crucial to have reliable evaluation methods. Sentential paraphrases must fulfill a variety of requirements: preserve the meaning of the original sentence, be grammatical, and meet any stylistic or task-specific constraints. We analyze common evaluation practices and propose better methods that more accurately measure the quality of output. Often overlooked, robust automatic evaluation methodology is necessary for improving systems, and this work presents new metrics and outlines important considerations for reliably measuring the quality of the generated text

    Beyond topic-based representations for text mining

    Get PDF
    A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these "skeletons" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features

    Automatic case acquisition from texts for process-oriented case-based reasoning

    Get PDF
    This paper introduces a method for the automatic acquisition of a rich case representation from free text for process-oriented case-based reasoning. Case engineering is among the most complicated and costly tasks in implementing a case-based reasoning system. This is especially so for process-oriented case-based reasoning, where more expressive case representations are generally used and, in our opinion, actually required for satisfactory case adaptation. In this context, the ability to acquire cases automatically from procedural texts is a major step forward in order to reason on processes. We therefore detail a methodology that makes case acquisition from processes described as free text possible, with special attention given to assembly instruction texts. This methodology extends the techniques we used to extract actions from cooking recipes. We argue that techniques taken from natural language processing are required for this task, and that they give satisfactory results. An evaluation based on our implemented prototype extracting workflows from recipe texts is provided.Comment: Sous presse, publication pr\'evue en 201
    • …
    corecore