Search CORE

22 research outputs found

Implementation of recursive queries for information systems

Author: Jeyaraman Jayalakshmi
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2008
Field of study

Sophisticated information systems require a powerful query language and an efficient implementation strategy. In practice, these information systems are either built on the top of an existing database management system or built as an expert system with deductive capabilities. Both of these implementations must provide a mechanism to express recursive queries. It is therefore a necessity for the system to have an efficient algorithm to evaluate these queries. In this thesis, we give a detailed description of a bibliographic database, a set of recursive queries, an overview of some standard query processing algorithms, and an implementation of these queries in DATALOG

University of Nevada, Las Vegas Repository

Radio and television information filtering through speech recognition

Author: Vries A.P. (Arjen) de
Publication venue: Universiteit Twente
Publication date: 01/01/1995
Field of study

CWI's Institutional Repository

Using IR techniques for text classification in document analysis

Author: Hoch Rainer
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1994
Field of study

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

CiteSeerX

Universaar

Acronym

Probabilistic retrieval of OCR degraded text using N-grams

Author: A. Zamora
C. Pierce
D.J. Cohen
E. Ukkonen
H. Turtle
J. Zobel
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Feature recognition in OCR text

Author: Gilbreth Jeffrey Todd
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1996
Field of study

This thesis investigates the recognition and extraction of special word sequences, representing concepts, from OCR text. Unlike general index terms, concepts can consist of one or more terms that combined, have higher retrieval value than the terms alone (i.e. acronyms, proper nouns, phrases). An algorithm to recognize acronyms and their definitions will be presented. An evaluation of the algorithm will also be presented

University of Nevada, Las Vegas Repository

A relational post-processing approach for forms recognition

Author: Mao Zhenxing
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2003
Field of study

Optical Character Recognition (OCR) is used to convert paper documents into electronic form. Unfortunately the technology is not perfect and the output can be erroneous. Conversion then is generally augmented by manual error detection and correction procedures which can be very costly; One approach to minimizing cost is to apply an OCR post processing system that will reduce the amount of manual correction required. The post processor takes advantage of knowledge associated with a particular project; In this thesis, we look into the feasibility of using integrity constraints to detect and correct errors in forms recognition. The general idea is to construct a database of form values that can be used to direct recognition and consequently, make automatic correction

University of Nevada, Las Vegas Repository

A post processing system for global correction of Ocr generated errors

Author: Bullard Bryan Edward
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1992
Field of study

This thesis discusses the design and implementation of an OCR post processing system. The system is used to perform automatic spelling detection and correction on noisy, OCR generated text. Unlike previous post processing systems, this system works in conjunction with an inverted file database system. The initial results obtained from post processing 10,000 pages of OCR\u27ed text are encouraging. These results indicate that the use of global and local document information extracted from the inverted file system can be effectively used to correct OCR generated spelling errors

University of Nevada, Las Vegas Repository

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Author: A Acerbi
A Bingham
B Nicholson
CD Brown
DJ Cohen
E Mittendorf
HI Xie
HI Xie
K Taghva
K Taghva
N Fuhr
S Tanner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved

Crossref

VU Research Portal

CWI's Institutional Repository

Autotag: A tool for creating structured document collections from printed materials

Author: Condit Allen S
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1994
Field of study

Today\u27s optical character recognition (OCR) devices ordinarily are not capable of delimiting or marking up specific structural information about the document such as the title, its authors, and titles of sections. Such information appears in the OCR device output, but would require a human to go through the output to locate the information. This type of information is highly useful for information retrieval (IR), allowing users much more flexibility in making queries of a retrieval system. This thesis will describe the design, implementation, and evaluation of a software system called Autotag. This system will automatically markup structural information in OCR-generated text. It will also establish a mapping between objects in page images and their corresponding ASCII representation. This mapping can then be used to design flexible image-based interfaces for information retrieval related applications

University of Nevada, Las Vegas Repository

Post Processing of Optically Recognized Text using First Order Hidden Markov Model

Author: Malreddy Spandana
Publication venue: Digital Scholarship@UNLV
Publication date: 01/12/2012
Field of study

In this thesis, we report on our design and implementation of a post processing system for Optically Recognized text. The system is based on first order Hidden Markov Model (HMM). The Maximum Likelihood algorithm is used to train the system with over 150 thousand characters. The system is also tested on a file containing 5688 characters. The percentage of errors detected and corrected is 11.76% with a recall of 10.16% and precision of 100

University of Nevada, Las Vegas Repository