7 research outputs found

    Análise de layout de página em jornais históricos germano-brasileiros

    Get PDF
    Orientador: Daniel WeingaertnerDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 30/08/2019Inclui referências: p. 71-75Área de concentração: Ciência da ComputaçãoResumo: Projetos de digitalizacao em massa tem surgido em todo mundo. No Brasil, um dos exemplos e a iniciativa Dokumente.br que preocupa-se em disponibilizar acervos brasileiros em lingua alema. Parte de seu acervo e composto por jornais historicos escritos com a fonte gotica Fraktur e precisam ter seus caracteres reconhecidos opticamente. Um bom desempenho nesta tarefa esta relacionado ao sucesso da etapa anterior do workflow de OCR, a analise de layout. As ferramentas OCR open source existentes nao conseguem atingir bons resultados de analise de layout neste tipo de material. Com o objetivo de corrigir esta lacuna, propomos duas abordagens para a analise de layout dos jornais da iniciativa Dokumente.br: a primeira delas, que chamamos de GBN-MHS, e uma implementacao do algoritmo "MHS 2017 System" proposto por Tran et al. (2017). A segunda abordagem e baseada em deep learning e a nomeamos de GBN-DL. Para avaliar o desempenho dos nossos metodos criamos o German-Brazilian Newspaper Dataset (GBN 1.0) e ja preparamos seu ground truth para analise de layout e tambem para OCR. Comparamos os resultados obtidos pelo analisador de layout do software Tesseract no dataset proposto e os resultados obtidos pelos metodos GBN-MHS e GBN-DL. Criamos dois cenarios de avaliacao: um composto por jornais que foram representados no conjunto de treinamento (Cenario 1) e outro com paginas de jornais que nao foram representados no conjunto de treinamento (Cenario 2). GBN-MHS e GBN-DL atingiram melhores resultados que Tesseract nos dois cenarios avaliados. No Cenario 1, GBN-DL conseguiu 92,81% de acuracia, GBN-MHS obteve 88,12% e Tesseract atingiu apenas 71,83%. No Cenario 2, GBN-DL atingiu 96,96%, GBN-MHS conseguiu 95,16% e Tesseract obteve 88,15% de acuracia. Os bons resultados atingidos pelos metodos propostos demonstram o potencial das nossas abordagens e o experimento tambem comprova como as ferramentas OCR open source existentes nao estao totalmente preparadas para trabalhar com documentos historicos. Palavras-chave: digitalizacao de jornais. sistemas OCR. analise de layout de pagina. segmentacao de paginas de jornais. OCR. OCR em Fraktur. Tesseract. OCRopy.Abstract: Mass digitization projects have emerged around the world. In Brazil, one example is the Dokumente.br initiative that aims at providing Brazilian collections in the German language. Part of its collection consists of historical newspapers written in the Gothic font Fraktur which need to have their characters recognized optically. A good performance in this task is related to the success of the previous OCR workflow step, the page layout analysis. The available open source OCR tools are not able to achieve good layout analysis results in this type of material. In order to correct this gap, two approaches to the layout analysis of the newspapers from the Dokumente.br initiative were proposed in this work: the first of these, which we call GBN-MHS, is an implementation of the "MHS 2017 System" algorithm proposed by Tran et al. (2017). The second proposal is based on deep learning and we call it GBN-DL. To evaluate the performance of the proposed methods we created the German-Brazilian Newspaper Dataset (GBN 1.0) and have already prepared its ground truth for layout analysis and also for OCR. We compared the results obtained by the layout analyzer from software Tesseract in the proposed dataset and the results obtained by the GBN-MHS and GBN-DL methods. We created two evaluation scenarios: one of them consists of newspapers that were represented in the training dataset (Scenario 1) and the other consists of newspaper pages that were not represented in the training dataset (Scenario 2). GBN-MHS and GBN-DL achieved better results than Tesseract in the two scenarios evaluated. In Scenario 1, GBN-DL achieved 92.81% in accuracy, GBN-MHS achieved 88.12% and Tesseract only 71.83%. In Scenario 2, GBN-DL reached 96.96%, GBN-MHS reached 95.16 % and Tesseract achieved 88.15 % in accuracy. The good results achieved by the proposed methods demonstrate the potential of our approaches, and the experiments also evidence that available open source OCR tools are not fully prepared to work with historical documents. Keywords: digitalization of newspapers. OCR systems. page layout analysis. page segmentation of newspapers. OCR. Fraktur OCR. Tesseract. OCRopy

    Retargeting of Heterogeneous Document to Improve Reading Experience

    Get PDF
    人们越来越多的依赖于移动终端设备阅读各种数字内容。这些设备在屏幕分辨率、长宽比等参数上的差异对数字内容的处理提出了新的挑战,图像适配成了近年的研; 究热点。各种内容敏感的方法被提出以解决如何在图像缩放时减少重要物体的严重扭曲。然而,对于以图像形式存在的、包含不同元素的异构文档,由于其分辨率一; 般比较高,在小尺寸的移动设备上只能部分显示以保证可读性,用户不得不频繁地进行缩放、平移以阅读整个文档,极大的影响了阅读效率。为此提出了一种针对异; 构文档的适配方法,通过对文档布局的局部分析,自动抽取得到用户拟阅读的矩形区域,并适配到屏幕上,避免了繁琐的缩放、平移操作,极大地提高了阅读的效率; 。People rely more and more on mobile devices to read various digital; contents. The variation on parameters such as screen resolution, aspect; ratio of these devices presents new challenges on digital content; processing. Image retargeting has become popular in the past decade.; Various content-aware methods are presented to reduce the distortion of; important object when the image is scaled. However, for those bitmap; represented heterogeneous documents which contains various elements, due; to the high resolution, it can only be partially displayed on devices; with small screen region. Frequently switch between scale and translate; are required to read the whole document, which obviously affect user's; reading experience. We propose a retargeting method for heterogeneous; document. We firstly analysis the layout of document in a local manner,; and then extract the appropriate rectangular reading area and resize it; to match the screen. Our method avoids the tedious scale and translation; operations, and thus improves the reading experience greatly.国家自然科学基金项目; 国家科技支撑计划项目; 福建省经济和信息化委员会2013年技术创新专项资

    Analyse d’images de documents patrimoniaux : une approche structurelle à base de texture

    Get PDF
    Over the last few years, there has been tremendous growth in digitizing collections of cultural heritage documents. Thus, many challenges and open issues have been raised, such as information retrieval in digital libraries or analyzing page content of historical books. Recently, an important need has emerged which consists in designing a computer-aided characterization and categorization tool, able to index or group historical digitized book pages according to several criteria, mainly the layout structure and/or typographic/graphical characteristics of the historical document image content. Thus, the work conducted in this thesis presents an automatic approach for characterization and categorization of historical book pages. The proposed approach is applicable to a large variety of ancient books. In addition, it does not assume a priori knowledge regarding document image layout and content. It is based on the use of texture and graph algorithms to provide a rich and holistic description of the layout and content of the analyzed book pages to characterize and categorize historical book pages. The categorization is based on the characterization of the digitized page content by texture, shape, geometric and topological descriptors. This characterization is represented by a structural signature. More precisely, the signature-based characterization approach consists of two main stages. The first stage is extracting homogeneous regions. Then, the second one is proposing a graph-based page signature which is based on the extracted homogeneous regions, reflecting its layout and content. Afterwards, by comparing the different obtained graph-based signatures using a graph-matching paradigm, the similarities of digitized historical book page layout and/or content can be deduced. Subsequently, book pages with similar layout and/or content can be categorized and grouped, and a table of contents/summary of the analyzed digitized historical book can be provided automatically. As a consequence, numerous signature-based applications (e.g. information retrieval in digital libraries according to several criteria, page categorization) can be implemented for managing effectively a corpus or collections of books. To illustrate the effectiveness of the proposed page signature, a detailed experimental evaluation has been conducted in this work for assessing two possible categorization applications, unsupervised page classification and page stream segmentation. In addition, the different steps of the proposed approach have been evaluated on a large variety of historical document images.Les récents progrès dans la numérisation des collections de documents patrimoniaux ont ravivé de nouveaux défis afin de garantir une conservation durable et de fournir un accès plus large aux documents anciens. En parallèle de la recherche d'information dans les bibliothèques numériques ou l'analyse du contenu des pages numérisées dans les ouvrages anciens, la caractérisation et la catégorisation des pages d'ouvrages anciens a connu récemment un regain d'intérêt. Les efforts se concentrent autant sur le développement d'outils rapides et automatiques de caractérisation et catégorisation des pages d'ouvrages anciens, capables de classer les pages d'un ouvrage numérisé en fonction de plusieurs critères, notamment la structure des mises en page et/ou les caractéristiques typographiques/graphiques du contenu de ces pages. Ainsi, dans le cadre de cette thèse, nous proposons une approche permettant la caractérisation et la catégorisation automatiques des pages d'un ouvrage ancien. L'approche proposée se veut indépendante de la structure et du contenu de l'ouvrage analysé. Le principal avantage de ce travail réside dans le fait que l'approche s'affranchit des connaissances préalables, que ce soit concernant le contenu du document ou sa structure. Elle est basée sur une analyse des descripteurs de texture et une représentation structurelle en graphe afin de fournir une description riche permettant une catégorisation à partir du contenu graphique (capturé par la texture) et des mises en page (représentées par des graphes). En effet, cette catégorisation s'appuie sur la caractérisation du contenu de la page numérisée à l'aide d'une analyse des descripteurs de texture, de forme, géométriques et topologiques. Cette caractérisation est définie à l'aide d'une représentation structurelle. Dans le détail, l'approche de catégorisation se décompose en deux étapes principales successives. La première consiste à extraire des régions homogènes. La seconde vise à proposer une signature structurelle à base de texture, sous la forme d'un graphe, construite à partir des régions homogènes extraites et reflétant la structure de la page analysée. Cette signature assure la mise en œuvre de nombreuses applications pour gérer efficacement un corpus ou des collections de livres patrimoniaux (par exemple, la recherche d'information dans les bibliothèques numériques en fonction de plusieurs critères, ou la catégorisation des pages d'un même ouvrage). En comparant les différentes signatures structurelles par le biais de la distance d'édition entre graphes, les similitudes entre les pages d'un même ouvrage en termes de leurs mises en page et/ou contenus peuvent être déduites. Ainsi de suite, les pages ayant des mises en page et/ou contenus similaires peuvent être catégorisées, et un résumé/une table des matières de l'ouvrage analysé peut être alors généré automatiquement. Pour illustrer l'efficacité de la signature proposée, une étude expérimentale détaillée a été menée dans ce travail pour évaluer deux applications possibles de catégorisation de pages d'un même ouvrage, la classification non supervisée de pages et la segmentation de flux de pages d'un même ouvrage. En outre, les différentes étapes de l'approche proposée ont donné lieu à des évaluations par le biais d'expérimentations menées sur un large corpus de documents patrimoniaux

    Contributions au tri automatique de documents et de courrier d'entreprises

    Get PDF
    Ce travail de thèse s inscrit dans le cadre du développement de systèmes de vision industrielle pour le tri automatique de documents et de courriers d entreprises. Les architectures existantes, dont nous avons balayé les spécificités dans les trois premiers chapitres de la thèse, présentent des faiblesses qui se traduisent par des erreurs de lecture et des rejets que l on impute encore trop souvent aux OCR. Or, les étapes responsables de ces rejets et de ces erreurs de lecture sont les premières à intervenir dans le processus. Nous avons ainsi choisi de porter notre contribution sur les aspects inhérents à la segmentation des images de courriers et la localisation de leurs régions d intérêt en investissant une nouvelle approche pyramidale de modélisation par coloration hiérarchique de graphes ; à ce jour, la coloration de graphes n a jamais été exploitée dans un tel contexte. Elle intervient dans notre contribution à toutes les étapes d analyse de la structure des documents ainsi que dans la prise de décision pour la reconnaissance (reconnaissance de la nature du document à traiter et reconnaissance du bloc adresse). Notre architecture a été conçue pour réaliser essentiellement les étapes d analyse de structures et de reconnaissance en garantissant une réelle coopération entres les différents modules d analyse et de décision. Elle s articule autour de trois grandes parties : une partie de segmentation bas niveau (binarisation et recherche de connexités), une partie d extraction de la structure physique par coloration hiérarchique de graphe et une partie de localisation de blocs adresse et de classification de documents. Les algorithmes impliqués dans le système ont été conçus pour leur rapidité d exécution (en adéquation avec les contraintes de temps réels), leur robustesse, et leur compatibilité. Les expérimentations réalisées dans ce contexte sont très encourageantes et offrent également de nouvelles perspectives à une plus grande diversité d images de documents.This thesis deals with the development of industrial vision systems for automatic business documents and mail sorting. These systems need very high processing time, accuracy and precision of results. The current systems are most of time made of sequential modules needing fast and efficient algorithms throughout the processing line: from low to high level stages of analysis and content recognition. The existing architectures that we have described in the three first chapters of the thesis have shown their weaknesses that are expressed by reading errors and OCR rejections. The modules that are responsible of these rejections and reading errors are mostly the first to occur in the processes of image segmentation and interest regions location. Indeed, theses two processes, involving each other, are fundamental for the system performances and the efficiency of the automatic sorting lines. In this thesis, we have chosen to focus on different sides of mail images segmentation and of relevant zones (as address block) location. We have chosen to develop a model based on a new pyramidal approach using a hierarchical graph coloring. As for now, graph coloring has never been exploited in such context. It has been introduced in our contribution at every stage of document layout analysis for the recognition and decision tasks (kind of document or address block recognition). The recognition stage is made about a training process with a unique model of graph b-coloring. Our architecture is basically designed to guarantee a good cooperation bewtween the different modules of decision and analysis for the layout analysis and the recognition stages. It is composed of three main sections: the low-level segmentation (binarisation and connected component labeling), the physical layout extraction by hierarchical graph coloring and the address block location and document sorting. The algorithms involved in the system have been designed for their execution speed (matching with real time constraints), their robustness, and their compatibility. The experimentations made in this context are very encouraging and lead to investigate a wider diversity of document images.VILLEURBANNE-DOC'INSA-Bib. elec. (692669901) / SudocSudocFranceF

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy
    corecore