223 research outputs found

    Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

    Full text link
    Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.Comment: Journal of Data Mining and Digital Humanities; Major revisio

    Information Extraction in an Optical Character Recognition Context

    Full text link
    In this dissertation, we investigate the effectiveness of information extraction in the presence of Optical Character Recognition (OCR). It is well known that the OCR errors have no effects on general retrieval tasks. This is mainly due to the redundancy of information in textual documents. Our work shows that information extraction task is significantly influenced by OCR errors. Intuitively, this is due to the fact that extraction algorithms rely on a small window of text surrounding the objects to be extracted. We show that extraction methodologies based on the Hidden Markov Models are not robust enough to deal with extraction in this noisy environment. We also show that both precise shallow parsing and fuzzy shallow parsing can be used to increase the recall at the price of a significant drop in the precision. Most of our experimental work deals with the extraction of dates of birth and extraction of postal addresses. Both of these specific extractions are part of general methods of identification of privacy information in textual documents. Privacy information is particularly important when large collections of documents are posted on the Internet

    A novel image matching approach for word spotting

    Get PDF
    Word spotting has been adopted and used by various researchers as a complementary technique to Optical Character Recognition for document analysis and retrieval. The various applications of word spotting include document indexing, image retrieval and information filtering. The important factors in word spotting techniques are pre-processing, selection and extraction of proper features and image matching algorithms. The Correlation Similarity Measure (CORR) algorithm is considered to be a faster matching algorithm, originally defined for finding similarities between binary patterns. In the word spotting literature the CORR algorithm has been used successfully to compare the GSC binary features extracted from binary word images, i.e., Gradient, Structural and Concavity (GSC) features. However, the problem with this approach is that binarization of images leads to a loss of very useful information. Furthermore, before extracting GSC binary features the word images must be skew corrected and slant normalized, which is not only difficult but in some cases impossible in Arabic and modified Arabic scripts. We present a new approach in which the Correlation Similarity Measure (CORR) algorithm has been used innovatively to compare Gray-scale word images. In this approach, binarization of images, skew correction and slant normalization of word images are not required at all. The various features, i.e., projection profiles, word profiles and transitional features are extracted from the Gray-scale word images and converted into their binary equivalents, which are compared via CORR algorithm with greater speed and higher accuracy. The experiments have been conducted on Gray-scale versions of newly created handwritten databases of Pashto and Dari languages, written in modified Arabic scripts. For each of these languages we have used 4599 words relating to 21 different word classes collected from 219 writers. The average precision rates achieved for Pashto and Dari languages were 93.18 % and 93.75 %, respectively. The time taken for matching a pair of images was 1.43 milli-seconds. In addition, we will present the handwritten databases for two well-known Indo- Iranian languages, i.e., Pashto and Dari languages. These are large databases which contain six types of data, i.e., Dates, Isolated Digits, Numeral Strings, Isolated Characters, Different Words and Special Symbols, written by native speakers of the corresponding languages

    Functionality Analysis and Information Retrieval in Electronic Document Management Systems

    Get PDF
    A document management system (DMS) is nowadays one of the most impactful organisational tools that an enterprise may be dependent on. De Angeli Prodotti (DAP), a manufacturer for overhead conductors, wanted to implement an opensource DMS with functionalities that best fit their needs. We took this opportunity to also test and evaluate the state of information retrieval capabilities of electronic DMSs

    Algorithms for document image skew estimation

    Full text link
    A new projection profile based skew estimation algorithm was developed. This algorithm extracts fiducial points representing character elements by decoding a JBIG compressed image without reconstructing the original image. These points are projected along parallel lines into an accumulator array to determine the maximum alignment and the corresponding skew angle. Methods for characterizing the performance of skew estimation techniques were also investigated. In addition to the new skew estimator, three projection based algorithms were implemented and tested using 1,246 single column text zones extracted from a sample of 460 page images. Linear regression analyses of the experimental results indicate that our new skew estimation algorithm performs competitively with the other three techniques. These analyses also show that estimators using connected components as a fiducial representation perform worse than the others on the entire set of text zones. It is shown that all of the algorithms are sensitive to typographical features. The number of text lines in a zone significantly affects the accuracy of the connected component based methods. We also developed two aggregate measures of skew for entire pages. Experiments performed on the 460 unconstrained pages indicate the need to filter non-text features from consideration. Graphic and noise elements from page images contribute a significant amount of the error for the JBIG algorithm
    • …
    corecore