8 research outputs found

    Automatic Ground-truth Generation for Document Image Analysis and Understanding

    Full text link

    On-the-fly Historical Handwritten Text Annotation

    Full text link
    The performance of information retrieval algorithms depends upon the availability of ground truth labels annotated by experts. This is an important prerequisite, and difficulties arise when the annotated ground truth labels are incorrect or incomplete due to high levels of degradation. To address this problem, this paper presents a simple method to perform on-the-fly annotation of degraded historical handwritten text in ancient manuscripts. The proposed method aims at quick generation of ground truth and correction of inaccurate annotations such that the bounding box perfectly encapsulates the word, and contains no added noise from the background or surroundings. This method will potentially be of help to historians and researchers in generating and correcting word labels in a document dynamically. The effectiveness of the annotation method is empirically evaluated on an archival manuscript collection from well-known publicly available datasets

    Automated Ground Truth Data Generation for Newspaper Document Images

    Full text link
    In document image understanding, public ground-truthed datasets are an important part of scientific work. They do not only helpful for developing new methods, but they are also a point of intersection allowing to compare the methods performance without need to implement it. For document image understanding several datasets exists, each having its own pros and cons. Generating these datasets is time consuming and costly work and therefore each existing and new dataset is valuable. In this paper we propose a way to generate a ground-truthed dataset for newspapers. The ground truth in focus is layout analysis ground truth. The proposed two step approach consists of a layout generating module and an image matching module allowing to match the ground truth information from the synthetic data to the scanned version. Using the “MyNews ” system, newspaper layouts are generated using a news corpus. The output con-sists of a digital newspaper (PDF file) and an XML file con-taining geometric and logical layout information. In the second step, the PDF files are printed and scanned. Then the scanned document image is aligned with the synthetic image obtained by rendering the PDF. Finally the geometric and logical layout ground truth is mapped onto the scanned image.

    Automatic Ground-truth Generation for Document Image Analysis and Understanding

    No full text
    International audiencePerformance evaluation for document image analysis and understanding is a recurring problem. Many ground- truthed document image databases are now used to evalu- ate algorithms, but these databases are less useful for the design of a complete system in a precise context. This pa- per proposes an approach for the automatic generation of synthesised document images and associated ground-truth information based on a derivation of publishing tools. An implementation of this approach illustrates the richness of the produced information

    Évaluation de la qualité des documents anciens numérisés

    Get PDF
    Les travaux de recherche présentés dans ce manuscrit décrivent plusieurs apports au thème de l évaluation de la qualité d images de documents numérisés. Pour cela nous proposons de nouveaux descripteurs permettant de quantifier les dégradations les plus couramment rencontrées sur les images de documents numérisés. Nous proposons également une méthodologie s appuyant sur le calcul de ces descripteurs et permettant de prédire les performances d algorithmes de traitement et d analyse d images de documents. Les descripteurs sont définis en analysant l influence des dégradations sur les performances de différents algorithmes, puis utilisés pour créer des modèles de prédiction à l aide de régresseurs statistiques. La pertinence, des descripteurs proposés et de la méthodologie de prédiction, est validée de plusieurs façons. Premièrement, par la prédiction des performances de onze algorithmes de binarisation. Deuxièmement par la création d un processus automatique de sélection de l algorithme de binarisation le plus performant pour chaque image. Puis pour finir, par la prédiction des performances de deux OCRs en fonction de l importance du défaut de transparence (diffusion de l encre du recto sur le verso d un document). Ce travail sur la prédiction des performances d algorithmes est aussi l occasion d aborder les problèmes scientifiques liés à la création de vérités-terrains et d évaluation de performances.This PhD. thesis deals with quality evaluation of digitized document images. In order to measure the quality of a document image, we propose to create new features dedicated to the characterization of most commons degradations. We also propose to use these features to create prediction models able to predict the performances of different types of document analysis algorithms. The features are defined by analyzing the impact of a specific degradation on the results of an algorithm and then used to create statistical regressors.The relevance of the proposed features and predictions models, is analyzed in several experimentations. The first one aims to predict the performance of different binarization methods. The second experiment aims to create an automatic procedure able to select the best binarization method for each image. At last, the third experiment aims to create a prediction model for two commonly used OCRs. This work on performance prediction algorithms is also an opportunity to discuss the scientific problems of creating ground-truth for performance evaluation.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF