Search CORE

43,794 research outputs found

Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora

Author: Ingólfsdóttir Svanhvít Lilja
Jónsson Haukur Páll
Ragnarsson Pétur Orri
Snæbjarnarson Vésteinn
Símonarson Haukur Barri
Þorsteinsson Vilhjálmur
Publication venue
Publication date: 29/05/2023
Field of study

Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. Approaching the problem as a sequence-to-sequence task, we compare the use of a common subword unit vocabulary and byte-level encoding. Initial synthetic training data is created using an error-generating pipeline, and used for finetuning two subword-level models and one byte-level model. Models are then finetuned further on hand-corrected error corpora, including texts written by children, university students, dyslexic and second-language writers, and evaluated over different error types and origins. We show that a byte-level model enables higher correction quality than a subword approach, not only for simple spelling errors, but also for more complex semantic, stylistic and grammatical issues. In particular, initial training on synthetic corpora followed by finetuning on a relatively small parallel corpus of real-world errors helps the byte-level model correct a wide range of commonly occurring errors. Our experiments are run for the Icelandic language but should hold for other similar languages, particularly morphologically rich ones

arXiv.org e-Print Archive

Recommended from our members

The NOMAD system : expectation-based detection and correction of errors during understanding of syntactically and semantically ill-formed text

Author: Granger Richard H.
Publication venue: eScholarship, University of California
Publication date: 07/12/1983
Field of study

Most large text-understanding systems have been designed under the assumption that the input text will be in reasonably "neat" form (for example, newspaper stories and other edited texts). However, a great deal of natural language text (for example, memos, messages, rough drafts, conversation transcripts, etc.) have features that differ significantly from "neat" texts, posing special problems for readers, such as misspelled words, missing words, poor syntactic construction, unclear or ambiguous interpretation, missing crucial punctuation, etc. Our solution to these problems is to make use of expectations, based both on knowledge of surface English and on world knowledge of the situation being described. These syntactic and semantic expectations can be used to figure out unknown words from context, constrain the possible word senses of words with multiple meanings (ambiguity), fill in missing words (ellipsis), and resolve referents (anaphora). This method of using expectations to aid the understanding of "scruffy" texts has bee incorporated into a working computer program called NOMAD, which understands scruffy texts in the domain of Navy ship-to-shore messages

eScholarship - University of California

Enhanced Integrated Scoring for Cleaning Dirty Texts

Author: Bennamoun Mohammed
Liu Wei
Wong Wilson
Publication venue
Publication date: 06/02/2008
Field of study

An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of only basic ISSAC and of Aspell, respectively.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

arXiv.org e-Print Archive

CiteSeerX

Supporting collocation learning with a digital library

Author: Franken Margaret
Witten Ian H.
Wu Shaoqun
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2010
Field of study

Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus

Research Commons@Waikato

Explicit vs. Implicit L2 grammar knowledge in written error correction

Author: Cinciała Marcin
Scheffler Pawel
Publication venue: Wydawnictwo Uniwersytetu Łódzkiego
Publication date: 01/01/2011
Field of study

Error correction is undoubtedly an important part of the process of drafting and producing written texts. The aim of the paper is to analyse the learners’ ability to correct grammatical errors in relation to the type of knowledge they employ in this task. Green and Hecht (1992), in an often quoted study, found a low correlation between L2 learners’ knowledge of explicit grammar rules and their ability to correct errors. They interpret this as suggesting that in error correction, learners rely primarily on their implicit knowledge. However, certain design features of their study might have caused the subjects to simply guess the correct forms, which, in turn, as DeKeyser (2003) suggests, may have led to the overestimation of implicit knowledge. This paper reports the results of an experiment where 150 Polish learners of English were administered a corpus-based error correction task, the design of which, however, differed from that of Green and Hecht (1992). These alterations resulted in finding a much closer link between the subjects’ knowledge of rules and their ability to correct grammatical errors

Adam Mickiewicz University Repository

Repozytorium Uniwersytetu im. Adama Mickiewicza (AMUR)

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

Effect of screen presentation on text reading and revising. International Journal of Human-Computer Studies

Author: Piolat A
Roussey JY
Thunin O
Publication venue
Publication date: 01/01/1997
Field of study

Two studies using the methods of experimental psychology assessed the effects of two types of text presentation (page-by-page vs. scrolling) on participants' performance while reading and revising texts. Greater facilitative effects of the page-by-page presentation were observed in both tasks. The participants' reading task performance indicated that they built a better mental representation of the text as a whole and were better at locating relevant information and remembering the main ideas. Their revising task performance indicated a larger number of global corrections (which are the most difficult to make)

CogPrints Cognitive Sciences Eprint Archive

Linguistic variation in Greek papyri: towards a new tool for quantitative study

Author: Depauw Mark
Stolk Joanne Vera
Publication venue: Durham, NC : Duke University, Department of Classical Studies ; Durham, NC : Duke University Library
Publication date: 01/01/2015
Field of study

Ghent University Academic Bibliography

Directory of Open Access Journals

Archivsystem Ask23

Revising strategies for different text types

Author: Guercin F
Piolat A
Roussey JY
Publication venue
Publication date: 01/01/1990
Field of study

Forty-eight children and forty-eight adults of contrasting degrees of expertise made a series of corrections in order to improve a text (narrative or description) in which three within-statement errors and three between-statement errors had been inserted. Subjects used a simplified word processor (SCRIPREV) which recorded all movements of linguistic units. The purpose of this research was to study revising strategies by examining the correction-sequencing procedures implemented by these subjects. The procedures, which were coded in the form of time series, were compared to the time series of model revising procedures (i.e. effective ones) representing three strategies based on certain predefined functional principles (linguistic level, execution order). The adults used two of these strategies: the Simultaneous Strategy for the narrative, and the Local-then-Global Strategy for the description. The children used the Local-then-Global Strategy for the narrative, but did not use any identifiable procedure to revise the description, which they did not manage to totally improve in the expected manner

CogPrints Cognitive Sciences Eprint Archive

Semi-automatic annotation process for procedural texts: An application on cooking recipes

Author: Ber Florence Le
Dufour-Lussier Valmi
Lieber Jean
Meilender Thomas
Nauer Emmanuel
Publication venue
Publication date: 27/08/2012
Field of study

Taaable is a case-based reasoning system that adapts cooking recipes to user constraints. Within it, the preparation part of recipes is formalised as a graph. This graph is a semantic representation of the sequence of instructions composing the cooking process and is used to compute the procedure adaptation, conjointly with the textual adaptation. It is composed of cooking actions and ingredients, among others, represented as vertices, and semantic relations between those, shown as arcs, and is built automatically thanks to natural language processing. The results of the automatic annotation process is often a disconnected graph, representing an incomplete annotation, or may contain errors. Therefore, a validating and correcting step is required. In this paper, we present an existing graphic tool named \kcatos, conceived for representing and editing decision trees, and show how it has been adapted and integrated in WikiTaaable, the semantic wiki in which the knowledge used by Taaable is stored. This interface provides the wiki users with a way to correct the case representation of the cooking process, improving at the same time the quality of the knowledge about cooking procedures stored in WikiTaaable

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-INSU