446 research outputs found

    Contextual word spotting in historical handwritten documents

    Get PDF
    Existen incontables colecciones de documentos históricos en archivos y librerías repletos de valiosa información para historiadores e investigadores. La extracción de esta información se ha convertido en una de las principales tareas para investigadores del área de análisis de documentos. Hay un interés creciente en digitalizar, conservar y dar acceso a este tipo de documentos. Pero sólo la digitalización no es suficiente para los investigadores. La extracción y/o indexación de la información de estos documentos tiene un creciente interés entre los investigadores. En muchos casos, y en particular en documentos históricos, la completa trascripción de estos documentos es extremadamente difícil debido a dificultades intrínsecas: preservación física pobre, diferentes estilos de escritura, lenguajes obsoletos, etc. La búsqueda de palabras se convierte en una popular y eficiente alternativa a la tran-scripción completa. Este método conlleva una inherente degradación de las imágenes. La búsqueda de palabras se formula holísticamente como una búsqueda visual de una forma dada en un conjunto grande de imágenes, en vez de reconocer el texto y buscar la palabra mediante la comparación de códigos ascii. Pero el rendimiento de los métodos de búsqueda de palabras clásicos puede verse afectado por el nivel de degradación de las imágenes, que en algunos casos pueden ser inaceptables. Por esta razón, proponemos una búsqueda de palabras contextual que utiliza la información contextual/semántica para obtener resultados donde los métodos de búsqueda clásica no lo logran un rendimiento aceptable. El sistema de búsqueda de palabras contextual propuesto en esta tesis utiliza un método de búsqueda de palabras basado en segmentación, y por tanto es necesaria una segmentación de palabras precisa. Documentos históricos manuscritos presentan algunas dificultades que pueden dificultar la extracción de palabras. Proponemos un método de segmentación de palabras que formula el problema como la búsqueda del camino central en el area que hay entre dos líneas consecutivas. Esto se resuelve mediante un problema de grafo transversal. Un algoritmo de búsqueda de caminos es utilizado para encontrar el camino óptimo en el grafo, calculado previamente, entre dos líneas de texto. Una vez las líneas se han extraído, las palabras son localizadas dentro de las líneas de texto utilizando un método del estado del arte para segmentar palabras. Los métodos de búsqueda clásicos pueden mejor utilizando la información contextual de los documentos. Presentamos un nuevo sistema, orientado a documentos manuscritos que presentan una estructura a los largo de sus páginas, para extraer la información uti-lizando información contextual. El sistema es una eficiente herramienta para la transcripción semiautomática que utiliza la información contextual para obtener mejores resultados que los métodos de búsqueda convencionales. La información contextual es descubierta automáticamente reconociendo estructuras repetitivas y categorizando las palabras con su correspondiente clase semántica. Se extraen las palabras más frecuentes de cada clase semántica y así el mismo texto es utilizado para transcribir todas ellas. Los resultados experimentales obtenidos en esta tesis mejoran los resultados de los métodos clásicos de búsqueda de palabras, demostrando idoneidad de la arquitectura propuesta para la búsqueda de palabras en documentos históricos manuscritos utilizando la información contextual.There are countless collections of historical documents in archives and libraries that contain plenty of valuable information for historians and researchers. The extraction of this information has become a central task among the Document Analysis researches and practitioners. There is an increasing interest to digital preserve and provide access to these kind of documents. But only the digitalization is not enough for the researchers. The extraction and/or indexation of information of this documents has had an increased interest among researchers. In many cases, and in particular in historical manuscripts, the full transcription of these documents is extremely di cult due the inherent de ciencies: poor physical preservation, di erent writing styles, obsolete languages, etc. Word spotting has become a popular an e cient alternative to full transcription. It inherently involves a high level of degradation in the images. The search of words is holistically formulated as a visual search of a given query shape in a larger image, instead of recognising the input text and searching the query word with an ascii string comparison. But the performance of classical word spotting approaches depend on the degradation level of the images being unacceptable in many cases . In this thesis we have proposed a novel paradigm called contextual word spotting method that uses the contextual/semantic information to achieve acceptable results whereas classical word spotting does not reach. The contextual word spotting framework proposed in this thesis is a segmentation-based word spotting approach, so an e cient word segmentation is needed. Historical handwritten documents present some common di culties that can increase the di culties the extraction of the words. We have proposed a line segmentation approach that formulates the problem as nding the central part path in the area between two consecutive lines. This is solved as a graph traversal problem. A path nding algorithm is used to nd the optimal path in a graph, previously computed, between the text lines. Once the text lines are extracted, words are localized inside the text lines using a word segmentation technique from the state of the art. Classical word spotting approaches can be improved using the contextual information of the documents. We have introduced a new framework, oriented to handwritten documents that present a highly structure, to extract information making use of context. The framework is an e cient tool for semi-automatic transcription that uses the contextual information to achieve better results than classical word spotting approaches. The contextual information is automatically discovered by recognizing repetitive structures and categorizing all the words according to semantic classes. The most frequent words in each semantic cluster are extracted and the same text is used to transcribe all them. The experimental results achieved in this thesis outperform classical word spotting approaches demonstrating the suitability of the proposed ensemble architecture for spotting words in historical handwritten documents using contextual information

    The impact of the image processing in the indexation system

    Get PDF
    This paper presents an efficient word spotting system applied to handwritten Arabic documents, where images are represented with bag-of-visual-SIFT descriptors and a sliding window approach is used to locate the regions that are most similar to the query by following the query-by-example paragon. First, a pre-processing step is used to produce a better representation of the most informative features. Secondly, a region-based framework is deployed to represent each local region by a bag-of-visual-SIFT descriptors. Afterward, some experiments are in order to demonstrate the codebook size influence on the efficiency of the system, by analyzing the curse of dimensionality curve. In the end, to measure the similarity score, a floating distance based on the descriptor’s number for each query is adopted. The experimental results prove the efficiency of the proposed processing steps in the word spotting system

    Combined cosine-linear regression model similarity with application to handwritten word spotting

    Get PDF
    The similarity or the distance measure have been used widely to calculate the similarity or dissimilarity between vector sequences, where the document images similarity is known as the domain that dealing with image information and both similarity/distance has been an important role for matching and pattern recognition. There are several types of similarity measure, we cover in this paper the survey of various distance measures used in the images matching and we explain the limitations associated with the existing distances. Then, we introduce the concept of the floating distance which describes the variation of the threshold’s selection for each word in decision making process, based on a combination of Linear Regression and cosine distance. Experiments are carried out on a handwritten Arabic image documents of Gallica library. These experiments show that the proposed floating distance outperforms the traditional distance in word spotting system

    Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval

    Full text link
    We summarize math search engines and search interfaces produced by the Document and Pattern Recognition Lab in recent years, and in particular the min math search interface and the Tangent search engine. Source code for both systems are publicly available. "The Masses" refers to our emphasis on creating systems for mathematical non-experts, who may be looking to define unfamiliar notation, or browse documents based on the visual appearance of formulae rather than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer Mathematics (July, Washington DC

    Graph-based word spotting by inexact matching techniques

    Get PDF
    Al llarg d'aquest projecte s'ha desenvolupat un nou mètode de word spotting (localització de paraules) en què es té molt en compte l'estructura de les paraules a buscar. Aquestes tècniques consisteixen a trobar paraules escrites a mà, a partir d'un exemple. La tècnica presentada s'ha desenvolupat per utilitzar-la en documents antics. Seguidament, es presenta una indexació per tal d'accelerar el procés de cerca. Aquesta indexació consisteix a trobar ràpidament un conjunt de candidats on aplicar tècniques de word spotting en grans col·leccions de documents. Finalment, es mostra un exemple d'aplicació de les tècniques desenvolupades en una aplicació per a dispositius Android.A lo largo del proyecto se ha desarrollado un nuevo método de word spotting (localización de palabras) en el cual se tiene muy en consideración la estructura de las palabras a buscar. Estas técnicas consisten en encontrar palabras escritas a mano partiendo de un ejemplo. La técnica presentada se ha desarrollado utilizándola en documentos antiguos. Seguidamente, se presenta una indexación con el objetivo de acelerar el proceso de búsqueda. Esta indexación consiste en encontrar rápidamente un conjunto de candidatos donde aplicar técnicas de word spotting en grandes colecciones de documentos. Finalmente, se muestra un ejemplo de aplicación de la técnica desarrollada en una aplicación para dispositivos Android.Along this project a new method for word spotting (location of words) has been developed. This method has in mind the structure of the words to search. These techniques consist in finding handwritten words from a given example. The presented technique has been meant to be used in old documents. Afterwards an indexation process is presented to speed up the search step. This indexation is used to find a set of candidates in large document collections in order to apply word spotting techniques. Finally, an example application of the developed techniques is proposed for Android devices

    Contextual Word Spotting in Historical Handwritten Documents

    Get PDF
    Advisor/s: Josep Lladós, Alicia Fornés. Date and location of PhD thesis defense: 14 November 2014, Autonomous University of BarcelonaThere are countless collections of historical documents in archives and libraries that contain plenty of valuable information for historians and researchers. The extraction of this information has become a central task among the Document Analysis researches and practitioners. There is an increasing interest to digital preserve and provide access to these kind of documents. But only the digitalization is not enough for the researchers. The extraction and/or indexation of information of this documents has had an increased interest among researchers. In many cases, and in particular in historical manuscripts, the full transcription of these documents is extremely difficult due the inherent deficiencies: poor physical preservation, different writing styles, obsolete languages, etc.Word spotting has become a popular an efficient alternative to full transcription. It inherently involves a high level of degradation in the images. The search of words is holistically formulated as a visual search of a given query shape in a larger image, instead of recognising the input text and searching the query word with an ascii string comparison. But the performance of classical word spotting approaches depend on the degradation level of the images being unacceptable in many cases . In this thesis we have proposed a novel paradigm called contextual word spotting method that uses the contextual/semantic information to achieve acceptable results whereas classical word spotting does not reach.The contextual word spotting framework proposed in this thesis is a segmentation-based word spotting approach, so an efficient word segmentation is needed. Historical handwritten documents present some common difficulties that can increase the difficulties the extraction of the words. We have proposed a line segmentation approach that formulates the problem as finding the central part path in the area between two consecutive lines. This is solved as a graph traversal problem. A path finding algorithm is used to find the optimal path in a graph, previously computed, between the text lines. Once the text lines are extracted, words are localized inside the text lines using a word segmentation technique from the state of the art.Classical word spotting approaches can be improved using the contextual information of the documents. We have introduced a new framework, oriented to handwritten documents that present a highly structure, to extract information making use of context. The framework is an efficient tool for semi-automatic transcription that uses the contextual information to achieve better results than classical word spotting approaches. The contextual information is automatically discovered by recognizing repetitive structures and categorizing all the words according to semantic classes. The most frequent words in each semantic cluster are extracted and the same text is used to transcribe all them.The experimental results achieved in this thesis outperform classical word spotting approaches demonstrating the suitability of the proposed ensemble architecture for spotting words in historical handwritten documents using contextual information

    Contextual Word Spotting in Historical Handwritten Documents

    Get PDF
    There are countless collections of historical documents in archives and libraries that contain plenty of valuable information for historians and researchers. The extraction of this information has become a central task among the Document Analysis researches and practitioners. There is an increasing interest to digital preserve and provide access to these kind of documents. But only the digitalization is not enough for the researchers. The extraction and/or indexation of information of this documents has had an increased interest among researchers. In many cases, and in particular in historical manuscripts, the full transcription of these documents is extremely difficult due the inherent deficiencies: poor physical preservation, different writing styles, obsolete languages, etc.Word spotting has become a popular an efficient alternative to full transcription. It inherently involves a high level of degradation in the images. The search of words is holistically formulated as a visual search of a given query shape in a larger image, instead of recognising the input text and searching the query word with an ascii string comparison. But the performance of classical word spotting approaches depend on the degradation level of the images being unacceptable in many cases . In this thesis we have proposed a novel paradigm called contextual word spotting method that uses the contextual/semantic information to achieve acceptable results whereas classical word spotting does not reach.The contextual word spotting framework proposed in this thesis is a segmentation-based word spotting approach, so an efficient word segmentation is needed. Historical handwritten documents present some common difficulties that can increase the difficulties the extraction of the words. We have proposed a line segmentation approach that formulates the problem as finding the central part path in the area between two consecutive lines. This is solved as a graph traversal problem. A path finding algorithm is used to find the optimal path in a graph, previously computed, between the text lines. Once the text lines are extracted, words are localized inside the text lines using a word segmentation technique from the state of the art.Classical word spotting approaches can be improved using the contextual information of the documents. We have introduced a new framework, oriented to handwritten documents that present a highly structure, to extract information making use of context. The framework is an efficient tool for semi-automatic transcription that uses the contextual information to achieve better results than classical word spotting approaches. The contextual information is automatically discovered by recognizing repetitive structures and categorizing all the words according to semantic classes. The most frequent words in each semantic cluster are extracted and the same text is used to transcribe all them.The experimental results achieved in this thesis outperform classical word spotting approaches demonstrating the suitability of the proposed ensemble architecture for spotting words in historical handwritten documents using contextual information

    Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

    Get PDF
    This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead
    corecore