22 research outputs found

    Implementation of recursive queries for information systems

    Full text link
    Sophisticated information systems require a powerful query language and an efficient implementation strategy. In practice, these information systems are either built on the top of an existing database management system or built as an expert system with deductive capabilities. Both of these implementations must provide a mechanism to express recursive queries. It is therefore a necessity for the system to have an efficient algorithm to evaluate these queries. In this thesis, we give a detailed description of a bibliographic database, a set of recursive queries, an overview of some standard query processing algorithms, and an implementation of these queries in DATALOG

    Using IR techniques for text classification in document analysis

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    Probabilistic retrieval of OCR degraded text using N-grams

    Full text link

    Feature recognition in OCR text

    Full text link
    This thesis investigates the recognition and extraction of special word sequences, representing concepts, from OCR text. Unlike general index terms, concepts can consist of one or more terms that combined, have higher retrieval value than the terms alone (i.e. acronyms, proper nouns, phrases). An algorithm to recognize acronyms and their definitions will be presented. An evaluation of the algorithm will also be presented

    A relational post-processing approach for forms recognition

    Full text link
    Optical Character Recognition (OCR) is used to convert paper documents into electronic form. Unfortunately the technology is not perfect and the output can be erroneous. Conversion then is generally augmented by manual error detection and correction procedures which can be very costly; One approach to minimizing cost is to apply an OCR post processing system that will reduce the amount of manual correction required. The post processor takes advantage of knowledge associated with a particular project; In this thesis, we look into the feasibility of using integrity constraints to detect and correct errors in forms recognition. The general idea is to construct a database of form values that can be used to direct recognition and consequently, make automatic correction

    A post processing system for global correction of Ocr generated errors

    Full text link
    This thesis discusses the design and implementation of an OCR post processing system. The system is used to perform automatic spelling detection and correction on noisy, OCR generated text. Unlike previous post processing systems, this system works in conjunction with an inverted file database system. The initial results obtained from post processing 10,000 pages of OCR\u27ed text are encouraging. These results indicate that the use of global and local document information extracted from the inverted file system can be effectively used to correct OCR generated spelling errors

    Impact Analysis of OCR Quality on Research Tasks in Digital Archives

    Get PDF
    Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved

    Autotag: A tool for creating structured document collections from printed materials

    Full text link
    Today\u27s optical character recognition (OCR) devices ordinarily are not capable of delimiting or marking up specific structural information about the document such as the title, its authors, and titles of sections. Such information appears in the OCR device output, but would require a human to go through the output to locate the information. This type of information is highly useful for information retrieval (IR), allowing users much more flexibility in making queries of a retrieval system. This thesis will describe the design, implementation, and evaluation of a software system called Autotag. This system will automatically markup structural information in OCR-generated text. It will also establish a mapping between objects in page images and their corresponding ASCII representation. This mapping can then be used to design flexible image-based interfaces for information retrieval related applications

    Post Processing of Optically Recognized Text using First Order Hidden Markov Model

    Full text link
    In this thesis, we report on our design and implementation of a post processing system for Optically Recognized text. The system is based on first order Hidden Markov Model (HMM). The Maximum Likelihood algorithm is used to train the system with over 150 thousand characters. The system is also tested on a file containing 5688 characters. The percentage of errors detected and corrected is 11.76% with a recall of 10.16% and precision of 100
    corecore