13 research outputs found

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Neural Networks for Document Image and Text Processing

    Full text link
    Nowadays, the main libraries and document archives are investing a considerable effort on digitizing their collections. Indeed, most of them are scanning the documents and publishing the resulting images without their corresponding transcriptions. This seriously limits the document exploitation possibilities. When the transcription is necessary, it is manually performed by human experts, which is a very expensive and error-prone task. Obtaining transcriptions to the level of required quality demands the intervention of human experts to review and correct the resulting output of the recognition engines. To this end, it is extremely useful to provide interactive tools to obtain and edit the transcription. Although text recognition is the final goal, several previous steps (known as preprocessing) are necessary in order to get a fine transcription from a digitized image. Document cleaning, enhancement, and binarization (if they are needed) are the first stages of the recognition pipeline. Historical Handwritten Documents, in addition, show several degradations, stains, ink-trough and other artifacts. Therefore, more sophisticated and elaborate methods are required when dealing with these kind of documents, even expert supervision in some cases is needed. Once images have been cleaned, main zones of the image have to be detected: those that contain text and other parts such as images, decorations, versal letters. Moreover, the relations among them and the final text have to be detected. Those preprocessing steps are critical for the final performance of the system since an error at this point will be propagated during the rest of the transcription process. The ultimate goal of the Document Image Analysis pipeline is to receive the transcription of the text (Optical Character Recognition and Handwritten Text Recognition). During this thesis we aimed to improve the main stages of the recognition pipeline, from the scanned documents as input to the final transcription. We focused our effort on applying Neural Networks and deep learning techniques directly on the document images to extract suitable features that will be used by the different tasks dealt during the following work: Image Cleaning and Enhancement (Document Image Binarization), Layout Extraction, Text Line Extraction, Text Line Normalization and finally decoding (or text line recognition). As one can see, the following work focuses on small improvements through the several Document Image Analysis stages, but also deals with some of the real challenges: historical manuscripts and documents without clear layouts or very degraded documents. Neural Networks are a central topic for the whole work collected in this document. Different convolutional models have been applied for document image cleaning and enhancement. Connectionist models have been used, as well, for text line extraction: first, for detecting interest points and combining them in text segments and, finally, extracting the lines by means of aggregation techniques; and second, for pixel labeling to extract the main body area of the text and then the limits of the lines. For text line preprocessing, i.e., to normalize the text lines before recognizing them, similar models have been used to detect the main body area and then to height-normalize the images giving more importance to the central area of the text. Finally, Convolutional Neural Networks and deep multilayer perceptrons have been combined with hidden Markov models to improve our transcription engine significantly. The suitability of all these approaches has been tested with different corpora for any of the stages dealt, giving competitive results for most of the methodologies presented.Hoy en día, las principales librerías y archivos está invirtiendo un esfuerzo considerable en la digitalización de sus colecciones. De hecho, la mayoría están escaneando estos documentos y publicando únicamente las imágenes sin transcripciones, limitando seriamente la posibilidad de explotar estos documentos. Cuando la transcripción es necesaria, esta se realiza normalmente por expertos de forma manual, lo cual es una tarea costosa y propensa a errores. Si se utilizan sistemas de reconocimiento automático se necesita la intervención de expertos humanos para revisar y corregir la salida de estos motores de reconocimiento. Por ello, es extremadamente útil para proporcionar herramientas interactivas con el fin de generar y corregir la transcripciones. Aunque el reconocimiento de texto es el objetivo final del Análisis de Documentos, varios pasos previos (preprocesamiento) son necesarios para conseguir una buena transcripción a partir de una imagen digitalizada. La limpieza, mejora y binarización de las imágenes son las primeras etapas del proceso de reconocimiento. Además, los manuscritos históricos tienen una mayor dificultad en el preprocesamiento, puesto que pueden mostrar varios tipos de degradaciones, manchas, tinta a través del papel y demás dificultades. Por lo tanto, este tipo de documentos requiere métodos de preprocesamiento más sofisticados. En algunos casos, incluso, se precisa de la supervisión de expertos para garantizar buenos resultados en esta etapa. Una vez que las imágenes han sido limpiadas, las diferentes zonas de la imagen deben de ser localizadas: texto, gráficos, dibujos, decoraciones, letras versales, etc. Por otra parte, también es importante conocer las relaciones entre estas entidades. Estas etapas del pre-procesamiento son críticas para el rendimiento final del sistema, ya que los errores cometidos en aquí se propagarán al resto del proceso de transcripción. El objetivo principal del trabajo presentado en este documento es mejorar las principales etapas del proceso de reconocimiento completo: desde las imágenes escaneadas hasta la transcripción final. Nuestros esfuerzos se centran en aplicar técnicas de Redes Neuronales (ANNs) y aprendizaje profundo directamente sobre las imágenes de los documentos, con la intención de extraer características adecuadas para las diferentes tareas: Limpieza y Mejora de Documentos, Extracción de Líneas, Normalización de Líneas de Texto y, finalmente, transcripción del texto. Como se puede apreciar, el trabajo se centra en pequeñas mejoras en diferentes etapas del Análisis y Procesamiento de Documentos, pero también trata de abordar tareas más complejas: manuscritos históricos, o documentos que presentan degradaciones. Las ANNs y el aprendizaje profundo son uno de los temas centrales de esta tesis. Diferentes modelos neuronales convolucionales se han desarrollado para la limpieza y mejora de imágenes de documentos. También se han utilizado modelos conexionistas para la extracción de líneas: primero, para detectar puntos de interés y segmentos de texto y, agregarlos para extraer las líneas del documento; y en segundo lugar, etiquetando directamente los píxeles de la imagen para extraer la zona central del texto y así definir los límites de las líneas. Para el preproceso de las líneas de texto, es decir, la normalización del texto antes del reconocimiento final, se han utilizado modelos similares a los mencionados para detectar la zona central del texto. Las imagenes se rescalan a una altura fija dando más importancia a esta zona central. Por último, en cuanto a reconocimiento de escritura manuscrita, se han combinado técnicas de ANNs y aprendizaje profundo con Modelos Ocultos de Markov, mejorando significativamente los resultados obtenidos previamente por nuestro motor de reconocimiento. La idoneidad de todos estos enfoques han sido testeados con diferentes corpus en cada una de las tareas tratadas., obtenieAvui en dia, les principals llibreries i arxius històrics estan invertint un esforç considerable en la digitalització de les seues col·leccions de documents. De fet, la majoria estan escanejant aquests documents i publicant únicament les imatges sense les seues transcripcions, fet que limita seriosament la possibilitat d'explotació d'aquests documents. Quan la transcripció del text és necessària, normalment aquesta és realitzada per experts de forma manual, la qual cosa és una tasca costosa i pot provocar errors. Si s'utilitzen sistemes de reconeixement automàtic es necessita la intervenció d'experts humans per a revisar i corregir l'eixida d'aquests motors de reconeixement. Per aquest motiu, és extremadament útil proporcionar eines interactives amb la finalitat de generar i corregir les transcripcions generades pels motors de reconeixement. Tot i que el reconeixement del text és l'objectiu final de l'Anàlisi de Documents, diversos passos previs (coneguts com preprocessament) són necessaris per a l'obtenció de transcripcions acurades a partir d'imatges digitalitzades. La neteja, millora i binarització de les imatges (si calen) són les primeres etapes prèvies al reconeixement. A més a més, els manuscrits històrics presenten una major dificultat d'analisi i preprocessament, perquè poden mostrar diversos tipus de degradacions, taques, tinta a través del paper i altres peculiaritats. Per tant, aquest tipus de documents requereixen mètodes de preprocessament més sofisticats. En alguns casos, fins i tot, es precisa de la supervisió d'experts per a garantir bons resultats en aquesta etapa. Una vegada que les imatges han sigut netejades, les diferents zones de la imatge han de ser localitzades: text, gràfics, dibuixos, decoracions, versals, etc. D'altra banda, també és important conéixer les relacions entre aquestes entitats i el text que contenen. Aquestes etapes del preprocessament són crítiques per al rendiment final del sistema, ja que els errors comesos en aquest moment es propagaran a la resta del procés de transcripció. L'objectiu principal del treball que estem presentant és millorar les principals etapes del procés de reconeixement, és a dir, des de les imatges escanejades fins a l'obtenció final de la transcripció del text. Els nostres esforços se centren en aplicar tècniques de Xarxes Neuronals (ANNs) i aprenentatge profund directament sobre les imatges de documents, amb la intenció d'extraure característiques adequades per a les diferents tasques analitzades: neteja i millora de documents, extracció de línies, normalització de línies de text i, finalment, transcripció. Com es pot apreciar, el treball realitzat aplica xicotetes millores en diferents etapes de l'Anàlisi de Documents, però també tracta d'abordar tasques més complexes: manuscrits històrics, o documents que presenten degradacions. Les ANNs i l'aprenentatge profund són un dels temes centrals d'aquesta tesi. Diferents models neuronals convolucionals s'han desenvolupat per a la neteja i millora de les dels documents. També s'han utilitzat models connexionistes per a la tasca d'extracció de línies: primer, per a detectar punts d'interés i segments de text i, agregar-los per a extraure les línies del document; i en segon lloc, etiquetant directament els pixels de la imatge per a extraure la zona central del text i així definir els límits de les línies. Per al preprocés de les línies de text, és a dir, la normalització del text abans del reconeixement final, s'han utilitzat models similars als utilitzats per a l'extracció de línies. Finalment, quant al reconeixement d'escriptura manuscrita, s'han combinat tècniques de ANNs i aprenentatge profund amb Models Ocults de Markov, que han millorat significativament els resultats obtinguts prèviament pel nostre motor de reconeixement. La idoneïtat de tots aquests enfocaments han sigut testejats amb diferents corpus en cadascuna de les tasques tractadPastor Pellicer, J. (2017). Neural Networks for Document Image and Text Processing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/90443TESI

    Collection of abstracts of the 24th European Workshop on Computational Geometry

    Get PDF
    International audienceThe 24th European Workshop on Computational Geomety (EuroCG'08) was held at INRIA Nancy - Grand Est & LORIA on March 18-20, 2008. The present collection of abstracts contains the 63 scientific contributions as well as three invited talks presented at the workshop

    Large bichromatic point sets admit empty monochromatic 4-gons

    No full text
    We consider a variation of a problem stated by Erd˝os and Szekeres in 1935 about the existence of a number fES(k) such that any set S of at least fES(k) points in general position in the plane has a subset of k points that are the vertices of a convex k-gon. In our setting the points of S are colored, and we say that a (not necessarily convex) spanned polygon is monochromatic if all its vertices have the same color. Moreover, a polygon is called empty if it does not contain any points of S in its interior. We show that any bichromatic set of n ≥ 5044 points in R2 in general position determines at least one empty, monochromatic quadrilateral (and thus linearly many).Postprint (published version

    Contribution to structural parameters computation: volume models and methods

    Get PDF
    Bio-CAD and in-silico experimentation are getting a growing interest in biomedical applications where scientific data coming from real samples are used to compute structural parameters that allow to evaluate physical properties. Non-invasive imaging acquisition technologies such as CT, mCT or MRI, plus the constant growth of computer capabilities, allow the acquisition, processing and visualization of scientific data with increasing degree of complexity. Structural parameters computation is based on the existence of two phases (or spaces) in the sample: the solid, which may correspond to the bone or material, and the empty or porous phase and, therefore, they are represented as binary volumes. The most common representation model for these datasets is the voxel model, which is the natural extension to 3D of 2D bitmaps. In this thesis, the Extreme Vertices Model (EVM) and a new proposed model, the Compact Union of Disjoint Boxes (CUDB), are used to represent binary volumes in a much more compact way. EVM stores only a sorted subset of vertices of the object¿s boundary whereas CUDB keeps a compact list of boxes. In this thesis, methods to compute the next structural parameters are proposed: pore-size distribution, connectivity, orientation, sphericity and roundness. The pore-size distribution helps to interpret the characteristics of porous samples by allowing users to observe most common pore diameter ranges as peaks in a graph. Connectivity is a topological property related to the genus of the solid space, measures the level of interconnectivity among elements, and is an indicator of the biomechanical characteristics of bone or other materials. The orientation of a shape can be defined by rotation angles around a set of orthogonal axes. Sphericity is a measure of how spherical is a particle, whereas roundness is the measure of the sharpness of a particle's edges and corners. The study of these parameters requires dealing with real samples scanned at high resolution, which usually generate huge datasets that require a lot of memory and large processing time to analyze them. For this reason, a new method to simplify binary volumes in a progressive and lossless way is presented. This method generates a level-of-detail sequence of objects, where each object is a bounding volume of the previous objects. Besides being used as support in the structural parameter computation, this method can be practical for task such as progressive transmission, collision detection and volume of interest computation. As part of multidisciplinary research, two practical applications have been developed to compute structural parameters of real samples. A software for automatic detection of characteristic viscosity points of basalt rocks and glasses samples, and another to compute sphericity and roundness of complex forms in a silica dataset.El Bio-Diseño Asistido por Computadora (Bio-CAD), y la experimentacion in-silico est an teniendo un creciente interes en aplicaciones biomedicas, en donde se utilizan datos cientificos provenientes de muestras reales para calcular par ametros estructurales que permiten evaluar propiedades físicas. Las tecnologías de adquisicion de imagen no invasivas como la TC, TC o IRM, y el crecimiento constante de las prestaciones de las computadoras, permiten la adquisicion, procesamiento y visualizacion de datos científicos con creciente grado de complejidad. El calculo de parametros estructurales esta basado en la existencia de dos fases (o espacios) en la muestra: la solida, que puede corresponder al hueso o material, y la fase porosa o vacía, por tanto, tales muestras son representadas como volumenes binarios. El modelo de representacion mas comun para estos conjuntos de datos es el modelo de voxeles, el cual es una extension natural a 3D de los mapas de bits 2D. En esta tesis se utilizan el modelo Extreme Verrtices Model (EVM) y un nuevo modelo propuesto, the Compact Union of Disjoint Boxes (CUDB), para representar los volumenes binarios en una forma mucho mas compacta. El modelo EVM almacena solo un subconjunto ordenado de vertices de la frontera del objeto mientras que el modelo CUDB mantiene una lista compacta de cajas. En esta tesis se proponen metodos para calcular los siguientes parametros estructurales: distribucion del tamaño de los poros, conectividad, orientacion, esfericidad y redondez. La distribucion del tamaño de los poros ayuda a interpretar las características de las muestras porosas permitiendo a los usuarios observar los rangos de diametro mas comunes de los poros mediante picos en un grafica. La conectividad es una propiedad topologica relacionada con el genero del espacio solido, mide el nivel de interconectividad entre los elementos, y es un indicador de las características biomecanicas del hueso o de otros materiales. La orientacion de un objeto puede ser definida por medio de angulos de rotacion alrededor de un conjunto de ejes ortogonales. La esfericidad es una medida de que tan esferica es una partícula, mientras que la redondez es la medida de la nitidez de sus aristas y esquinas. En el estudio de estos parametros se trabaja con muestras reales escaneadas a alta resolucion que suelen generar conjuntos de datos enormes, los cuales requieren una gran cantidad de memoria y mucho tiempo de procesamiento para ser analizados. Por esta razon, se presenta un nuevo metodo para simpli car vol umenes binarios de una manera progresiva y sin perdidas. Este metodo genera una secuencia de niveles de detalle de los objetos, en donde cada objeto es un volumen englobante de los objetos previos. Ademas de ser utilizado como apoyo en el calculo de parametros estructurales, este metodo puede ser de utilizado en otras tareas como transmision progresiva, deteccion de colisiones y calculo de volumen de interes. Como parte de una investigacion multidisciplinaria, se han desarrollado dos aplicaciones practicas para calcular parametros estructurales de muestras reales. Un software para la deteccion automatica de puntos de viscosidad característicos en muestras de rocas de basalto y vidrios, y una aplicacion para calcular la esfericidad y redondez de formas complejas en un conjunto de datos de sílice

    Geometric Layout Analysis of Scanned Documents

    Get PDF
    Layout analysis--the division of page images into text blocks, lines, and determination of their reading order--is a major performance limiting step in large scale document digitization projects. This thesis addresses this problem in several ways: it presents new performance measures to identify important classes of layout errors, evaluates the performance of state-of-the-art layout analysis algorithms, presents a number of methods to reduce the error rate and catastrophic failures occurring during layout analysis, and develops a statistically motivated, trainable layout analysis system that addresses the needs of large-scale document analysis applications. An overview of the key contributions of this thesis is as follows. First, this thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarization methods, but runs in time close to that of global thresholding methods, independent of the local window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to traditional local thresholding techniques. Then, this thesis presents a new perspective for document image cleanup. Instead of trying to explicitly detect and remove marginal noise, the approach focuses on locating the page frame, i.e. the actual page contents area. A geometric matching algorithm is presented to extract the page frame of a structured document. It is demonstrated that incorporating page frame detection step into document processing chain results in a reduction in OCR error rates from 4.3% to 1.7% (n=4,831,618 characters) on the UW-III dataset and layout-based retrieval error rates from 7.5% to 5.3% (n=815 documents) on the MARG dataset. The performance of six widely used page segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database is evaluated in this work using a state-of-the-art evaluation methodology. It is shown that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. Thus, a vectorial score is introduced that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, this evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. Based on a detailed analysis of the errors made by different page segmentation algorithms, this thesis presents a novel combination of the line-based approach by Breuel with the area-based approach of Baird which solves the over-segmentation problem in area-based approaches. This new approach achieves a mean text-line extraction error rate of 4.4% (n=878 documents) on the UW-III dataset, which is the lowest among the analyzed algorithms. This thesis also describes a simple, fast, and accurate system for document image zone classification that results from a detailed comparative analysis of performance of widely used features in document analysis and content-based image retrieval. Using a novel combination of known algorithms, an error rate of 1.46% (n=13,811 zones) is achieved on the UW-III dataset in comparison to a state-of-the-art system that reports an error rate of 1.55% (n=24,177 zones) using more complicated techniques. In addition to layout analysis of Roman script documents, this work also presents the first high-performance layout analysis method for Urdu script. For that purpose a geometric text-line model for Urdu script is presented. It is shown that the method can accurately extract Urdu text-lines from documents of different layouts like prose books, poetry books, magazines, and newspapers. Finally, this thesis presents a novel algorithm for probabilistic layout analysis that specifically addresses the needs of large-scale digitization projects. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. An algorithm based on A* search is presented for finding the most likely layout of a page, given its structural layout model. For training layout models, an EM-like algorithm is presented that is capable of learning the geometric variability of layout structures from data, without the need for a page segmentation ground-truth. Evaluation of the algorithm on documents from the MARG dataset shows an accuracy of above 95% for geometric layout analysis.Geometrische Layout-Analyse von gescannten Dokumente

    Learning From Almost No Data

    Get PDF
    The tremendous recent growth in the fields of artificial intelligence and machine learning has largely been tied to the availability of big data and massive amounts of compute. The increasingly popular approach of training large neural networks on large datasets has provided great returns, but it leaves behind the multitude of researchers, companies, and practitioners who do not have access to sufficient funding, compute power, or volume of data. This thesis aims to rectify this growing imbalance by probing the limits of what machine learning and deep learning methods can achieve with small data. What knowledge does a dataset contain? At the highest level, a dataset is just a collection of samples: images, text, etc. Yet somehow, when we train models on these datasets, they are able to find patterns, make inferences, detect similarities, and otherwise generalize to samples that they have previously never seen. This suggests that datasets may contain some kind of intrinsic knowledge about the systems or distributions from which they are sampled. Moreover, it appears that this knowledge is somehow distributed and duplicated across the samples; we intuitively expect that removing an image from a large training set will have virtually no impact on the final model performance. We develop a framework to explain efficient generalization around three principles: information sharing, information repackaging, and information injection. We use this framework to propose `less than one'-shot learning, an extreme form of few-shot learning where a learner must recognize N classes from M < N training examples. To achieve this extreme level of efficiency, we develop new framework-consistent methods and theory for lost data restoration, for dataset size reduction, and for few-shot learning with deep neural networks and other popular machine learning models
    corecore