1,888 research outputs found

    Statistics Oriented Preprocessing of Document Image

    Get PDF
    Old printed documents represent an important part of our cultural heritage. Their digitalization plays an important role in creating data and metadata. The paper proposed an algorithm for estimation of the global text skew. First, document image is binarized reducing the impact of noise and uneven illumination. The binary image is statistically analyzed and processed. Accordingly, redundant data have been excluded. Furthermore, the convex hulls are established encircling each text object. They are joined establishing connected components. Then, the connected components in complementary image are enlarged with morphological dilation. At the end, the biggest connected component is extracted. Its orientation is similar to the global orientation of text document which is calculated by the moments. Efficiency and correctness of the algorithm are verified by testing on a custom dataset

    Finding Similarities between Structured Documents as a Crucial Stage for Generic Structured Document Classifier

    Get PDF
    One of the addressed problems of classifying structured documents is the definition of a similarity measure that is applicable in real situations, where query documents are allowed to differ from the database templates. Furthermore, this approach might have rotated [1], noise corrupted [2], or manually edited form and documents as test sets using different schemes, making direct comparison crucial issue [3]. Another problem is huge amount of forms could be written in different languages, for example here in Malaysia forms could be written in Malay, Chinese, English, etc languages. In that case text recognition (like OCR) could not be applied in order to classify the requested documents taking into consideration that OCR is considered more easier and accurate rather than the layout  detection. Keywords: Feature Extraction, Document processing, Document Classification

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Techniques for document image processing in compressed domain

    Full text link
    The main objective for image compression is usually considered the minimization of storage space. However, as the need to frequently access images increases, it is becoming more important for people to process the compressed representation directly. In this work, the techniques that can be applied directly and efficiently to digital information encoded by a given compression algorithm are investigated. Lossless compression schemes and information processing algorithms for binary document images and text data are two closely related areas bridged together by the fast processing of coded data. The compressed domains, which have been addressed in this work, i.e., the ITU fax standards and JBIG standard, are two major schemes used for document compression. Based on ITU Group IV, a modified coding scheme, MG4, which explores the 2-dimensional correlation between scan lines, is developed. From the viewpoints of compression efficiency and processing flexibility of image operations, the MG4 coding principle and its feature-preserving behavior in the compressed domain are investigated and examined. Two popular coding schemes in the area of bi-level image compression, run-length and Group IV, are studied and compared with MG4 in the three aspects of compression complexity, compression ratio, and feasibility of compressed-domain algorithms. In particular, for the operations of connected component extraction, skew detection, and rotation, MG4 shows a significant speed advantage over conventional algorithms. Some useful techniques for processing the JBIG encoded images directly in the compressed domain, or concurrently while they are being decoded, are proposed and generalized; In the second part of this work, the possibility of facilitating image processing in the wavelet transform domain is investigated. The textured images can be distinguished from each other by examining their wavelet transforms. The basic idea is that highly textured regions can be segmented using feature vectors extracted from high frequency bands based on the observation that textured images have large energies in both high and middle frequencies while images in which the grey level varies smoothly are heavily dominated by the low-frequency channels in the wavelet transform domain. As a result, a new method is developed and implemented to detect textures and abnormalities existing in document images by using polynomial wavelets. Segmentation experiments indicate that this approach is superior to other traditional methods in terms of memory space and processing time

    Word matching using single closed contours for indexing handwritten historical documents

    Get PDF
    Effective indexing is crucial for providing convenient access to scanned versions of large collections of historically valuable handwritten manuscripts. Since traditional handwriting recognizers based on optical character recognition (OCR) do not perform well on historical documents, recently a holistic word recognition approach has gained in popularity as an attractive and more straightforward solution (Lavrenko et al. in proc. document Image Analysis for Libraries (DIAL’04), pp. 278–287, 2004). Such techniques attempt to recognize words based on scalar and profile-based features extracted from whole word images. In this paper, we propose a new approach to holistic word recognition for historical handwritten manuscripts based on matching word contours instead of whole images or word profiles. The new method consists of robust extraction of closed word contours and the application of an elastic contour matching technique proposed originally for general shapes (Adamek and O’Connor in IEEE Trans Circuits Syst Video Technol 5:2004). We demonstrate that multiscale contour-based descriptors can effectively capture intrinsic word features avoiding any segmentation of words into smaller subunits. Our experiments show a recognition accuracy of 83%, which considerably exceeds the performance of other systems reported in the literature

    Information Preserving Processing of Noisy Handwritten Document Images

    Get PDF
    Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%

    Automated framework for robust content-based verification of print-scan degraded text documents

    Get PDF
    Fraudulent documents frequently cause severe financial damages and impose security breaches to civil and government organizations. The rapid advances in technology and the widespread availability of personal computers has not reduced the use of printed documents. While digital documents can be verified by many robust and secure methods such as digital signatures and digital watermarks, verification of printed documents still relies on manual inspection of embedded physical security mechanisms.The objective of this thesis is to propose an efficient automated framework for robust content-based verification of printed documents. The principal issue is to achieve robustness with respect to the degradations and increased levels of noise that occur from multiple cycles of printing and scanning. It is shown that classic OCR systems fail under such conditions, moreover OCR systems typically rely heavily on the use of high level linguistic structures to improve recognition rates. However inferring knowledge about the contents of the document image from a-priori statistics is contrary to the nature of document verification. Instead a system is proposed that utilizes specific knowledge of the document to perform highly accurate content verification based on a Print-Scan degradation model and character shape recognition. Such specific knowledge of the document is a reasonable choice for the verification domain since the document contents are already known in order to verify them.The system analyses digital multi font PDF documents to generate a descriptive summary of the document, referred to as \Document Description Map" (DDM). The DDM is later used for verifying the content of printed and scanned copies of the original documents. The system utilizes 2-D Discrete Cosine Transform based features and an adaptive hierarchical classifier trained with synthetic data generated by a Print-Scan degradation model. The system is tested with varying degrees of Print-Scan Channel corruption on a variety of documents with corruption produced by repetitive printing and scanning of the test documents. Results show the approach achieves excellent accuracy and robustness despite the high level of noise
    • …
    corecore