43,794 research outputs found
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora
Grammatical error correction (GEC) is the task of correcting typos, spelling,
punctuation and grammatical issues in text. Approaching the problem as a
sequence-to-sequence task, we compare the use of a common subword unit
vocabulary and byte-level encoding. Initial synthetic training data is created
using an error-generating pipeline, and used for finetuning two subword-level
models and one byte-level model. Models are then finetuned further on
hand-corrected error corpora, including texts written by children, university
students, dyslexic and second-language writers, and evaluated over different
error types and origins. We show that a byte-level model enables higher
correction quality than a subword approach, not only for simple spelling
errors, but also for more complex semantic, stylistic and grammatical issues.
In particular, initial training on synthetic corpora followed by finetuning on
a relatively small parallel corpus of real-world errors helps the byte-level
model correct a wide range of commonly occurring errors. Our experiments are
run for the Icelandic language but should hold for other similar languages,
particularly morphologically rich ones
Recommended from our members
The NOMAD system : expectation-based detection and correction of errors during understanding of syntactically and semantically ill-formed text
Most large text-understanding systems have been designed under the assumption that the input text will be in reasonably "neat" form (for example, newspaper stories and other edited texts). However, a great deal of natural language text (for example, memos, messages, rough drafts, conversation transcripts, etc.) have features that differ significantly from "neat" texts, posing special problems for readers, such as misspelled words, missing words, poor syntactic construction, unclear or ambiguous interpretation, missing crucial punctuation, etc. Our solution to these problems is to make use of expectations, based both on knowledge of surface English and on world knowledge of the situation being described. These syntactic and semantic expectations can be used to figure out unknown words from context, constrain the possible word senses of words with multiple meanings (ambiguity), fill in missing words (ellipsis), and resolve referents (anaphora). This method of using expectations to aid the understanding of "scruffy" texts has bee incorporated into a working computer program called NOMAD, which understands scruffy texts in the domain of Navy ship-to-shore messages
Enhanced Integrated Scoring for Cleaning Dirty Texts
An increasing number of approaches for ontology engineering from text are
gearing towards the use of online sources such as company intranet and the
World Wide Web. Despite such rise, not much work can be found in aspects of
preprocessing and cleaning dirty texts from online sources. This paper presents
an enhancement of an Integrated Scoring for Spelling error correction,
Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as
part of a text preprocessing phase in an ontology engineering system. New
evaluations performed on the enhanced ISSAC using 700 chat records reveal an
improved accuracy of 98% as compared to 96.5% and 71% based on the use of only
basic ISSAC and of Aspell, respectively.Comment: More information is available at
http://explorer.csse.uwa.edu.au/reference
Supporting collocation learning with a digital library
Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus
Explicit vs. Implicit L2 grammar knowledge in written error correction
Error correction is undoubtedly an important part of the process of drafting and producing written texts. The aim of the paper is to analyse the learners’ ability to correct grammatical errors in relation to the type of knowledge they employ in this task. Green and Hecht (1992), in an often quoted study, found a low correlation between L2 learners’ knowledge of explicit grammar rules and their ability to correct errors. They interpret this as suggesting that in error correction, learners rely primarily on their implicit knowledge. However, certain design features of their study might have caused the subjects to simply guess the correct forms, which, in turn, as DeKeyser (2003) suggests, may have led to the overestimation of implicit knowledge. This paper reports the results of an experiment where 150 Polish learners of English were administered a corpus-based error correction task, the design of which, however, differed from that of Green and Hecht (1992). These alterations resulted in finding a much closer link between the subjects’ knowledge of rules and their ability to correct grammatical errors
Effect of screen presentation on text reading and revising. International Journal of Human-Computer Studies
Two studies using the methods of experimental psychology assessed the effects of two types of text presentation (page-by-page vs. scrolling) on participants' performance while reading and revising texts. Greater facilitative effects of the page-by-page presentation were observed in both tasks. The participants' reading task performance indicated that they built a better mental representation of the text as a whole and were better at locating relevant information and remembering the main ideas. Their revising task performance indicated a larger number of global corrections (which are the most difficult to make)
Revising strategies for different text types
Forty-eight children and forty-eight adults of contrasting degrees of expertise made a series of corrections in order to improve a text (narrative or description) in which three within-statement errors and three between-statement errors had been inserted. Subjects used a simplified word processor (SCRIPREV) which recorded all movements of linguistic units. The purpose of this research was to study revising strategies by examining the correction-sequencing procedures implemented by these subjects. The procedures, which were coded in the form of time series, were compared to the time series of model revising procedures (i.e. effective ones) representing three strategies based on certain predefined functional principles (linguistic level, execution order). The adults used two of these strategies: the Simultaneous Strategy for the narrative, and the Local-then-Global Strategy for the description. The children used the Local-then-Global Strategy for the narrative, but did not use any identifiable procedure to revise the description, which they did not manage to totally improve in the expected manner
Semi-automatic annotation process for procedural texts: An application on cooking recipes
Taaable is a case-based reasoning system that adapts cooking recipes to user
constraints. Within it, the preparation part of recipes is formalised as a
graph. This graph is a semantic representation of the sequence of instructions
composing the cooking process and is used to compute the procedure adaptation,
conjointly with the textual adaptation. It is composed of cooking actions and
ingredients, among others, represented as vertices, and semantic relations
between those, shown as arcs, and is built automatically thanks to natural
language processing. The results of the automatic annotation process is often a
disconnected graph, representing an incomplete annotation, or may contain
errors. Therefore, a validating and correcting step is required. In this paper,
we present an existing graphic tool named \kcatos, conceived for representing
and editing decision trees, and show how it has been adapted and integrated in
WikiTaaable, the semantic wiki in which the knowledge used by Taaable is
stored. This interface provides the wiki users with a way to correct the case
representation of the cooking process, improving at the same time the quality
of the knowledge about cooking procedures stored in WikiTaaable
- …