Search CORE

59,036 research outputs found

An automated Chinese text processing system (ACCESS): user-friendly interface and feature enhancement.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1994
Field of study

Suen Tow Sunny.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 65-67).Introduction --- p.1Chapter 1. --- ACCESS with an Extendible User-friendly X/Chinese Interface --- p.4Chapter 1.1. --- System requirement --- p.4Chapter 1.1.1. --- User interface issue --- p.4Chapter 1.1.2. --- Development issue --- p.5Chapter 1.2. --- Development decision --- p.6Chapter 1.2.1. --- X window system --- p.6Chapter 1.2.2. --- X/Chinese toolkit --- p.7Chapter 1.2.3. --- C language --- p.8Chapter 1.2.4. --- Source code control system --- p.8Chapter 1.3. --- System architecture --- p.9Chapter 1.4. --- User interface --- p.10Chapter 1.5. --- Sample screen --- p.13Chapter 1.6. --- System extension --- p.14Chapter 1.7. --- System portability --- p.18Chapter 2. --- Study on Algorithms for Automatically Correcting Characters in Chinese Cangjie-typed Text --- p.19Chapter 2.1. --- Chinese character input --- p.19Chapter 2.1.1. --- Chinese keyboards --- p.20Chapter 2.1.2. --- Keyboard redefinition scheme --- p.21Chapter 2.2. --- Cangjie input method --- p.24Chapter 2.3. --- Review on existing techniques for automatically correcting words in English text --- p.26Chapter 2.3.1. --- Nonword error detection --- p.27Chapter 2.3.2. --- Isolated-word error correction --- p.28Chapter 2.3.2.1. --- Spelling error patterns --- p.29Chapter 2.3.2.2. --- Correction techniques --- p.31Chapter 2.3.3. --- Context-dependent word correction research --- p.32Chapter 2.3.3.1. --- Natural language processing approach --- p.33Chapter 2.3.3.2. --- Statistical language model --- p.35Chapter 2.4. --- Research on error rates and patterns in Cangjie input method --- p.37Chapter 2.5. --- Similarities and differences between Chinese and English typed text --- p.41Chapter 2.5.1. --- Similarities --- p.41Chapter 2.5.2. --- Differences --- p.42Chapter 2.6. --- Proposed algorithm for automatic Chinese text correction --- p.44Chapter 2.6.1. --- Sentence level --- p.44Chapter 2.6.2. --- Part-of-speech level --- p.45Chapter 2.6.3. --- Character level --- p.47Conclusion --- p.50Appendix A Cangjie Radix Table --- p.51Appendix B Sample Text --- p.52Article 1 --- p.52Article 2 --- p.53Article 3 --- p.56Article 4 --- p.58Appendix C Error Statistics --- p.61References --- p.6

CUHK Digital Repository

Supporting collocation learning with a digital library

Author: Franken Margaret
Witten Ian H.
Wu Shaoqun
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2010
Field of study

Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus

Research Commons@Waikato

Enhanced Integrated Scoring for Cleaning Dirty Texts

Author: Bennamoun Mohammed
Liu Wei
Wong Wilson
Publication venue
Publication date: 06/02/2008
Field of study

An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of only basic ISSAC and of Aspell, respectively.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

arXiv.org e-Print Archive

CiteSeerX

Development of a text reading system on video images

Author: Merino Gracia Carlos
Publication venue
Publication date: 01/01/2015
Field of study

Since the early days of computer science researchers sought to devise a machine which could automatically read text to help people with visual impairments. The problem of extracting and recognising text on document images has been largely resolved, but reading text from images of natural scenes remains a challenge. Scene text can present uneven lighting, complex backgrounds or perspective and lens distortion; it usually appears as short sentences or isolated words and shows a very diverse set of typefaces. However, video sequences of natural scenes provide a temporal redundancy that can be exploited to compensate for some of these deficiencies. Here we present a complete end-to-end, real-time scene text reading system on video images based on perspective aware text tracking. The main contribution of this work is a system that automatically detects, recognises and tracks text in videos of natural scenes in real-time. The focus of our method is on large text found in outdoor environments, such as shop signs, street names and billboards. We introduce novel efficient techniques for text detection, text aggregation and text perspective estimation. Furthermore, we propose using a set of Unscented Kalman Filters (UKF) to maintain each text region¿s identity and to continuously track the homography transformation of the text into a fronto-parallel view, thereby being resilient to erratic camera motion and wide baseline changes in orientation. The orientation of each text line is estimated using a method that relies on the geometry of the characters themselves to estimate a rectifying homography. This is done irrespective of the view of the text over a large range of orientations. We also demonstrate a wearable head-mounted device for text reading that encases a camera for image acquisition and a pair of headphones for synthesized speech output. Our system is designed for continuous and unsupervised operation over long periods of time. It is completely automatic and features quick failure recovery and interactive text reading. It is also highly parallelised in order to maximize the usage of available processing power and to achieve real-time operation. We show comparative results that improve the current state-of-the-art when correcting perspective deformation of scene text. The end-to-end system performance is demonstrated on sequences recorded in outdoor scenarios. Finally, we also release a dataset of text tracking videos along with the annotated ground-truth of text regions

Repositorio Institucional de la Universidad de La Laguna

A tool for facilitating OCR postediting in historical documents

Author: Aboomar Mohammad
Buts Jan
Hadley James
Poncelas Alberto
Way Andy
Publication venue: LREC
Publication date: 23/04/2020
Field of study

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention

arXiv.org e-Print Archive

DCU Online Research Access Service

CloudScan - A configuration-free invoice analysis system using recurrent neural networks

Author: Laws Florian
Palm Rasmus Berg
Winther Ole
Publication venue
Publication date: 01/01/2017
Field of study

We present CloudScan; an invoice analysis system that requires zero configuration or upfront annotation. In contrast to previous work, CloudScan does not rely on templates of invoice layout, instead it learns a single global model of invoices that naturally generalizes to unseen invoice layouts. The model is trained using data automatically extracted from end-user provided feedback. This automatic training data extraction removes the requirement for users to annotate the data precisely. We describe a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system. We train and evaluate the system on 8 important fields using a dataset of 326,471 invoices. The recurrent neural network and baseline model achieve 0.891 and 0.887 average F1 scores respectively on seen invoice layouts. For the harder task of unseen invoice layouts, the recurrent neural network model outperforms the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201

arXiv.org e-Print Archive

Online Research Database In Technology