13 research outputs found

    A post processing system for global correction of Ocr generated errors

    Full text link
    This thesis discusses the design and implementation of an OCR post processing system. The system is used to perform automatic spelling detection and correction on noisy, OCR generated text. Unlike previous post processing systems, this system works in conjunction with an inverted file database system. The initial results obtained from post processing 10,000 pages of OCR\u27ed text are encouraging. These results indicate that the use of global and local document information extracted from the inverted file system can be effectively used to correct OCR generated spelling errors

    Robust extraction of text from camera images using colour and spatial information simultaneously

    Get PDF
    The importance and use of text extraction from camera based coloured scene images is rapidly increasing with time. Text within a camera grabbed image can contain a huge amount of meta data about that scene. Such meta data can be useful for identification, indexing and retrieval purposes. While the segmentation and recognition of text from document images is quite successful, detection of coloured scene text is a new challenge for all camera based images. Common problems for text extraction from camera based images are the lack of prior knowledge of any kind of text features such as colour, font, size and orientation as well as the location of the probable text regions. In this paper, we document the development of a fully automatic and extremely robust text segmentation technique that can be used for any type of camera grabbed frame be it single image or video. A new algorithm is proposed which can overcome the current problems of text segmentation. The algorithm exploits text appearance in terms of colour and spatial distribution. When the new text extraction technique was tested on a variety of camera based images it was found to out perform existing techniques (or something similar). The proposed technique also overcomes any problems that can arise due to an unconstraint complex background. The novelty in the works arises from the fact that this is the first time that colour and spatial information are used simultaneously for the purpose of text extraction

    Ottoman archives explorer: A retrieval system for digital Ottoman archives

    Get PDF
    This article presents Ottoman Archives Explorer, a Content-Based Retrieval (CBR) system based on character recognition for printed and handwritten historical documents. Several methods for character segmentation and recognition stages are investigated. In particular, sliding-window and histogram segmentation methods are coupled with recognition approaches using spatial features, neural networks, and a graph-based model. The prototype system provides CBR of document images using both example-based queries and a virtual keyboard to construct query words. © 2009 ACM

    Chinese information processing

    Full text link
    A survey of the field of Chinese information processing is provided. It covers the following areas: the Chinese writing system, several popular Chinese encoding schemes and code conversions, Chinese keyboard entry methods, Chinese fonts, Chinese operating systems, basic Chinese computing techniques and applications

    Document preprocessing and fuzzy unsupervised character classification

    Get PDF
    This dissertation presents document preprocessing and fuzzy unsupervised character classification for automatically reading daily-received office documents that have complex layout structures, such as multiple columns and mixed-mode contents of texts, graphics and half-tone pictures. First, the block segmentation algorithm is performed based on a simple two-step run-length smoothing to decompose a document into single-mode blocks. Next, the block classification is performed based on the clustering rules to classify each block into one of the types such as text, horizontal or vertical lines, graphics, and pictures. The mean white-to-black transition is shown as an invariance for textual blocks, and is useful for block discrimination. A fuzzy model for unsupervised character classification is designed to improve the robustness, correctness, and speed of the character recognition system. The classification procedures are divided into two stages. The first stage separates the characters into seven typographical categories based on word structures of a text line. The second stage uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. A fuzzy model of unsupervised character classification, which is more natural in the representation of prototypes for character matching, is defined and the weighted fuzzy similarity measure is explored. The characteristics of the fuzzy model are discussed and used in speeding up the classification process. After classification, the character recognition procedure is simply applied on the limited versions of the fuzzy prototypes. To avoid information loss and extra distortion, an topography-based approach is proposed to apply directly on the fuzzy prototypes to extract the skeletons. First, a convolution by a bell-shaped function is performed to obtain a smooth surface. Second, the ridge points are extracted by rule-based topographic analysis of the structure. Third, a membership function is assigned to ridge points with values indicating the degrees of membership with respect to the skeleton of an object. Finally, the significant ridge points are linked to form strokes of skeleton, and the clues of eigenvalue variation are used to deal with degradation and preserve connectivity. Experimental results show that our algorithm can reduce the deformation of junction points and correctly extract the whole skeleton although a character is broken into pieces. For some characters merged together, the breaking candidates can be easily located by searching for the saddle points. A pruning algorithm is then applied on each breaking position. At last, a multiple context confirmation can be applied to increase the reliability of breaking hypotheses

    Optical image scanners and character recognition devices : a survey and new taxonomy

    Get PDF
    Includes bibliographical references (p. [54]-[56]).Amar Gupta ... [et al.]

    Feature Extraction Methods for Character Recognition

    Get PDF
    Not Include

    A study on creating a custom South Sotho spellchecking and correcting software desktop application

    Get PDF
    Thesis (B. Tech.) - Central University of Technology, Free State, 200
    corecore