1,350 research outputs found

    Recognizing Handwriting Styles in a Historical Scanned Document Using Unsupervised Fuzzy Clustering

    Full text link
    The forensic attribution of the handwriting in a digitized document to multiple scribes is a challenging problem of high dimensionality. Unique handwriting styles may be dissimilar in a blend of several factors including character size, stroke width, loops, ductus, slant angles, and cursive ligatures. Previous work on labeled data with Hidden Markov models, support vector machines, and semi-supervised recurrent neural networks have provided moderate to high success. In this study, we successfully detect hand shifts in a historical manuscript through fuzzy soft clustering in combination with linear principal component analysis. This advance demonstrates the successful deployment of unsupervised methods for writer attribution of historical documents and forensic document analysis.Comment: 26 pages in total, 5 figures and 2 table

    User-driven Page Layout Analysis of historical printed Books

    Get PDF
    International audienceIn this paper, based on the study of the specificity of historical printed books, we first explain the main error sources in classical methods used for page layout analysis. We show that each method (bottom-up and top-down) provides different types of useful information that should not be ignored, if we want to obtain both a generic method and good segmentation results. Next, we propose to use a hybrid segmentation algorithm that builds two maps: a shape map that focuses on connected components and a background map, which provides information about white areas corresponding to block separations in the page. Using this first segmentation, a classification of the extracted blocks can be achieved according to scenarios produced by the user. These scenarios are defined very simply during an interactive stage. The user is able to make processing sequences adapted to the different kinds of images he is likely to meet and according to the user needs. The proposed “user-driven approach” is capable of doing segmentation and labelling of the required user high level concepts efficiently and has achieved above 93% accurate results over different data sets tested. User feedbacks and experimental results demonstrate the effectiveness and usability of our framework mainly because the extraction rules can be defined without difficulty and parameters are not sensitive to page layout variation

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    A word image coding technique and its applications in information retrieval from imaged documents

    Get PDF
    Master'sMASTER OF SCIENC
    • 

    corecore