59,036 research outputs found
An automated Chinese text processing system (ACCESS): user-friendly interface and feature enhancement.
Suen Tow Sunny.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 65-67).Introduction --- p.1Chapter 1. --- ACCESS with an Extendible User-friendly X/Chinese Interface --- p.4Chapter 1.1. --- System requirement --- p.4Chapter 1.1.1. --- User interface issue --- p.4Chapter 1.1.2. --- Development issue --- p.5Chapter 1.2. --- Development decision --- p.6Chapter 1.2.1. --- X window system --- p.6Chapter 1.2.2. --- X/Chinese toolkit --- p.7Chapter 1.2.3. --- C language --- p.8Chapter 1.2.4. --- Source code control system --- p.8Chapter 1.3. --- System architecture --- p.9Chapter 1.4. --- User interface --- p.10Chapter 1.5. --- Sample screen --- p.13Chapter 1.6. --- System extension --- p.14Chapter 1.7. --- System portability --- p.18Chapter 2. --- Study on Algorithms for Automatically Correcting Characters in Chinese Cangjie-typed Text --- p.19Chapter 2.1. --- Chinese character input --- p.19Chapter 2.1.1. --- Chinese keyboards --- p.20Chapter 2.1.2. --- Keyboard redefinition scheme --- p.21Chapter 2.2. --- Cangjie input method --- p.24Chapter 2.3. --- Review on existing techniques for automatically correcting words in English text --- p.26Chapter 2.3.1. --- Nonword error detection --- p.27Chapter 2.3.2. --- Isolated-word error correction --- p.28Chapter 2.3.2.1. --- Spelling error patterns --- p.29Chapter 2.3.2.2. --- Correction techniques --- p.31Chapter 2.3.3. --- Context-dependent word correction research --- p.32Chapter 2.3.3.1. --- Natural language processing approach --- p.33Chapter 2.3.3.2. --- Statistical language model --- p.35Chapter 2.4. --- Research on error rates and patterns in Cangjie input method --- p.37Chapter 2.5. --- Similarities and differences between Chinese and English typed text --- p.41Chapter 2.5.1. --- Similarities --- p.41Chapter 2.5.2. --- Differences --- p.42Chapter 2.6. --- Proposed algorithm for automatic Chinese text correction --- p.44Chapter 2.6.1. --- Sentence level --- p.44Chapter 2.6.2. --- Part-of-speech level --- p.45Chapter 2.6.3. --- Character level --- p.47Conclusion --- p.50Appendix A Cangjie Radix Table --- p.51Appendix B Sample Text --- p.52Article 1 --- p.52Article 2 --- p.53Article 3 --- p.56Article 4 --- p.58Appendix C Error Statistics --- p.61References --- p.6
Supporting collocation learning with a digital library
Extensive knowledge of collocations is a key factor that distinguishes learners from fluent native speakers. Such knowledge is difficult to acquire simply because there is so much of it. This paper describes a system that exploits the facilities offered by digital libraries to provide a rich collocation-learning environment. The design is based on three processes that have been identified as leading to lexical acquisition: noticing, retrieval and generation. Collocations are automatically identified in input documents using natural language processing techniques and used to enhance the presentation of the documents and also as the basis of exercises, produced under teacher control, that amplify students' collocation knowledge. The system uses a corpus of 1.3 B short phrases drawn from the web, from which 29 M collocations have been automatically identified. It also connects to examples garnered from the live web and the British National Corpus
Enhanced Integrated Scoring for Cleaning Dirty Texts
An increasing number of approaches for ontology engineering from text are
gearing towards the use of online sources such as company intranet and the
World Wide Web. Despite such rise, not much work can be found in aspects of
preprocessing and cleaning dirty texts from online sources. This paper presents
an enhancement of an Integrated Scoring for Spelling error correction,
Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as
part of a text preprocessing phase in an ontology engineering system. New
evaluations performed on the enhanced ISSAC using 700 chat records reveal an
improved accuracy of 98% as compared to 96.5% and 71% based on the use of only
basic ISSAC and of Aspell, respectively.Comment: More information is available at
http://explorer.csse.uwa.edu.au/reference
Development of a text reading system on video images
Since the early days of computer science researchers sought to devise a machine which could automatically read text to help people with visual impairments. The problem of extracting and recognising text on document images has been largely resolved, but reading text from images of natural scenes remains a challenge. Scene text can present uneven lighting, complex backgrounds or perspective and lens distortion; it usually appears as short sentences or isolated words and shows a very diverse set of typefaces. However, video sequences of natural scenes provide a temporal redundancy that can be exploited to compensate for some of these deficiencies. Here we present a complete end-to-end, real-time scene text reading system on video images based on perspective aware text tracking.
The main contribution of this work is a system that automatically detects, recognises and tracks text in videos of natural scenes in real-time. The focus of our method is on large text found in outdoor environments, such as shop signs, street names and billboards. We introduce novel efficient techniques for text detection, text aggregation and text perspective estimation. Furthermore, we propose using a set of Unscented Kalman Filters (UKF) to maintain each text region¿s identity and to continuously track the homography transformation of the text into a fronto-parallel view, thereby being resilient to erratic camera motion and wide baseline changes in orientation. The orientation of each text line is estimated using a method that relies on the geometry of the characters themselves to estimate a rectifying homography. This is done irrespective of the view of the text over a large range of orientations. We also demonstrate a wearable head-mounted device for text reading that encases a camera for image acquisition and a pair of headphones for synthesized speech output.
Our system is designed for continuous and unsupervised operation over long periods of time. It is completely automatic and features quick failure recovery and interactive text reading. It is also highly parallelised in order to maximize the usage of available processing power and to achieve real-time operation. We show comparative results that improve the current state-of-the-art when correcting perspective deformation of scene text. The end-to-end system performance is demonstrated on sequences recorded in outdoor scenarios. Finally, we also release a dataset of text tracking videos along with the annotated ground-truth of text regions
A tool for facilitating OCR postediting in historical documents
Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom. As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
- …