5,129 research outputs found
Contextual word spotting in historical handwritten documents
Existen incontables colecciones de documentos históricos en archivos y librerías repletos de valiosa información para historiadores e investigadores. La extracción de esta información se ha convertido en una de las principales tareas para investigadores del área de análisis de documentos. Hay un interés creciente en digitalizar, conservar y dar acceso a este tipo de documentos. Pero sólo la digitalización no es suficiente para los investigadores. La extracción y/o indexación de la información de estos documentos tiene un creciente interés entre los investigadores. En muchos casos, y en particular en documentos históricos, la completa trascripción de estos documentos es extremadamente difícil debido a dificultades intrínsecas: preservación física pobre, diferentes estilos de escritura, lenguajes obsoletos, etc. La búsqueda de palabras se convierte en una popular y eficiente alternativa a la tran-scripción completa. Este método conlleva una inherente degradación de las imágenes. La búsqueda de palabras se formula holísticamente como una búsqueda visual de una forma dada en un conjunto grande de imágenes, en vez de reconocer el texto y buscar la palabra mediante la comparación de códigos ascii. Pero el rendimiento de los métodos de búsqueda de palabras clásicos puede verse afectado por el nivel de degradación de las imágenes, que en algunos casos pueden ser inaceptables. Por esta razón, proponemos una búsqueda de palabras contextual que utiliza la información contextual/semántica para obtener resultados donde los métodos de búsqueda clásica no lo logran un rendimiento aceptable. El sistema de búsqueda de palabras contextual propuesto en esta tesis utiliza un método de búsqueda de palabras basado en segmentación, y por tanto es necesaria una segmentación de palabras precisa. Documentos históricos manuscritos presentan algunas dificultades que pueden dificultar la extracción de palabras. Proponemos un método de segmentación de palabras que formula el problema como la búsqueda del camino central en el area que hay entre dos líneas consecutivas. Esto se resuelve mediante un problema de grafo transversal. Un algoritmo de búsqueda de caminos es utilizado para encontrar el camino óptimo en el grafo, calculado previamente, entre dos líneas de texto. Una vez las líneas se han extraído, las palabras son localizadas dentro de las líneas de texto utilizando un método del estado del arte para segmentar palabras. Los métodos de búsqueda clásicos pueden mejor utilizando la información contextual de los documentos. Presentamos un nuevo sistema, orientado a documentos manuscritos que presentan una estructura a los largo de sus páginas, para extraer la información uti-lizando información contextual. El sistema es una eficiente herramienta para la transcripción semiautomática que utiliza la información contextual para obtener mejores resultados que los métodos de búsqueda convencionales. La información contextual es descubierta automáticamente reconociendo estructuras repetitivas y categorizando las palabras con su correspondiente clase semántica. Se extraen las palabras más frecuentes de cada clase semántica y así el mismo texto es utilizado para transcribir todas ellas. Los resultados experimentales obtenidos en esta tesis mejoran los resultados de los métodos clásicos de búsqueda de palabras, demostrando idoneidad de la arquitectura propuesta para la búsqueda de palabras en documentos históricos manuscritos utilizando la información contextual.There are countless collections of historical documents in archives and libraries that contain plenty of valuable information for historians and researchers. The extraction of this information has become a central task among the Document Analysis researches and practitioners. There is an increasing interest to digital preserve and provide access to these kind of documents. But only the digitalization is not enough for the researchers. The extraction and/or indexation of information of this documents has had an increased interest among researchers. In many cases, and in particular in historical manuscripts, the full transcription of these documents is extremely di cult due the inherent de ciencies: poor physical preservation, di erent writing styles, obsolete languages, etc. Word spotting has become a popular an e cient alternative to full transcription. It inherently involves a high level of degradation in the images. The search of words is holistically formulated as a visual search of a given query shape in a larger image, instead of recognising the input text and searching the query word with an ascii string comparison. But the performance of classical word spotting approaches depend on the degradation level of the images being unacceptable in many cases . In this thesis we have proposed a novel paradigm called contextual word spotting method that uses the contextual/semantic information to achieve acceptable results whereas classical word spotting does not reach. The contextual word spotting framework proposed in this thesis is a segmentation-based word spotting approach, so an e cient word segmentation is needed. Historical handwritten documents present some common di culties that can increase the di culties the extraction of the words. We have proposed a line segmentation approach that formulates the problem as nding the central part path in the area between two consecutive lines. This is solved as a graph traversal problem. A path nding algorithm is used to nd the optimal path in a graph, previously computed, between the text lines. Once the text lines are extracted, words are localized inside the text lines using a word segmentation technique from the state of the art. Classical word spotting approaches can be improved using the contextual information of the documents. We have introduced a new framework, oriented to handwritten documents that present a highly structure, to extract information making use of context. The framework is an e cient tool for semi-automatic transcription that uses the contextual information to achieve better results than classical word spotting approaches. The contextual information is automatically discovered by recognizing repetitive structures and categorizing all the words according to semantic classes. The most frequent words in each semantic cluster are extracted and the same text is used to transcribe all them. The experimental results achieved in this thesis outperform classical word spotting approaches demonstrating the suitability of the proposed ensemble architecture for spotting words in historical handwritten documents using contextual information
The impact of the image processing in the indexation system
This paper presents an efficient word spotting system applied to handwritten Arabic documents, where images are represented with bag-of-visual-SIFT descriptors and a sliding window approach is used to locate the regions that are most similar to the query by following the query-by-example paragon. First, a pre-processing step is used to produce a better representation of the most informative features. Secondly, a region-based framework is deployed to represent each local region by a bag-of-visual-SIFT descriptors. Afterward, some experiments are in order to demonstrate the codebook size influence on the efficiency of the system, by analyzing the curse of dimensionality curve. In the end, to measure the similarity score, a floating distance based on the descriptor’s number for each query is adopted. The experimental results prove the efficiency of the proposed processing steps in the word spotting system
Offline Recognition of Malayalam and Kannada Handwritten Documents Using Deep Learning
For a variety of reasons, handwritten text can be digitalized. It is used in a variety of government entities, including banks, post offices, and archaeological departments. Handwriting recognition, on the other hand, is a difficult task as everyone has a different writing style. There are essentially two methods for handwritten recognition: a holistic and an analytic approach. The previous methods of handwriting recognition are time- consuming. However, as deep neural networks have progressed, the approach has become more straightforward than previous methods. Furthermore, the bulk of existing solutions are limited to a single language. To recognise multilanguage handwritten manuscripts offline, this work employs an analytic approach. It describes how to convert Malayalam and Kannada handwritten manuscripts into editable text. Lines are separated from the input document first. After that, word segmentation is performed. Finally, each word is broken down into individual characters. An artificial neural network is utilised for feature extraction and classification. After that, the result is converted to a word document
Cross-document word matching for segmentation and retrieval of Ottoman divans
Cataloged from PDF version of article.Motivated by the need for the automatic
indexing and analysis of huge number of documents in
Ottoman divan poetry, and for discovering new knowledge
to preserve and make alive this heritage, in this study we
propose a novel method for segmenting and retrieving
words in Ottoman divans. Documents in Ottoman are dif-
ficult to segment into words without a prior knowledge of
the word. In this study, using the idea that divans have
multiple copies (versions) by different writers in different
writing styles, and word segmentation in some of those
versions may be relatively easier to achieve than in other
versions, segmentation of the versions (which are difficult,
if not impossible, with traditional techniques) is performed
using information carried from the simpler version. One
version of a document is used as the source dataset and the
other version of the same document is used as the target
dataset. Words in the source dataset are automatically
extracted and used as queries to be spotted in the target
dataset for detecting word boundaries. We present the idea
of cross-document word matching for a novel task of
segmenting historical documents into words. We propose a
matching scheme based on possible combinations of
sequence of sub-words. We improve the performance of
simple features through considering the words in a context.
The method is applied on two versions of Layla and
Majnun divan by Fuzuli. The results show that, the proposed
word-matching-based segmentation method is
promising in finding the word boundaries and in retrieving
the words across documents
A Tale of Two Transcriptions : Machine-Assisted Transcription of Historical Sources
This article is part of the "Norwegian Historical Population Register" project financed by the Norwegian Research Council (grant # 225950) and the Advanced Grand Project "Five Centuries of Marriages"(2011-2016) funded by the European Research Council (# ERC 2010-AdG_20100407)This article explains how two projects implement semi-automated transcription routines: for census sheets in Norway and marriage protocols from Barcelona. The Spanish system was created to transcribe the marriage license books from 1451 to 1905 for the Barcelona area; one of the world's longest series of preserved vital records. Thus, in the Project "Five Centuries of Marriages" (5CofM) at the Autonomous University of Barcelona's Center for Demographic Studies, the Barcelona Historical Marriage Database has been built. More than 600,000 records were transcribed by 150 transcribers working online. The Norwegian material is cross-sectional as it is the 1891 census, recorded on one sheet per person. This format and the underlining of keywords for several variables made it more feasible to semi-automate data entry than when many persons are listed on the same page. While Optical Character Recognition (OCR) for printed text is scientifically mature, computer vision research is now focused on more difficult problems such as handwriting recognition. In the marriage project, document analysis methods have been proposed to automatically recognize the marriage licenses. Fully automatic recognition is still a challenge, but some promising results have been obtained. In Spain, Norway and elsewhere the source material is available as scanned pictures on the Internet, opening up the possibility for further international cooperation concerning automating the transcription of historic source materials. Like what is being done in projects to digitize printed materials, the optimal solution is likely to be a combination of manual transcription and machine-assisted recognition also for hand-written sources
Word matching using single closed contours for indexing handwritten historical documents
Effective indexing is crucial for providing convenient access to scanned versions of large collections of historically valuable handwritten manuscripts. Since traditional handwriting recognizers based on optical character recognition (OCR) do not perform well on historical documents, recently a holistic word recognition approach has gained in popularity as an attractive and more straightforward solution (Lavrenko et al. in proc. document Image Analysis for Libraries (DIAL’04), pp. 278–287, 2004). Such techniques attempt to recognize words based on scalar and profile-based features extracted from whole word images. In this paper, we propose a new approach to holistic word recognition for historical handwritten manuscripts based on matching word contours instead of whole images or word profiles. The new method consists of robust extraction of closed word contours and the application of an elastic contour matching technique proposed originally for general shapes (Adamek and O’Connor in IEEE Trans Circuits Syst Video Technol 5:2004). We demonstrate that multiscale contour-based descriptors can effectively capture intrinsic word features avoiding any segmentation of words into smaller subunits. Our experiments show a recognition accuracy of 83%, which considerably exceeds the performance of other systems reported in the literature
Learning to Read by Spelling: Towards Unsupervised Text Recognition
This work presents a method for visual text recognition without using any
paired supervisory data. We formulate the text recognition task as one of
aligning the conditional distribution of strings predicted from given text
images, with lexically valid strings sampled from target corpora. This enables
fully automated, and unsupervised learning from just line-level text-images,
and unpaired text-string samples, obviating the need for large aligned
datasets. We present detailed analysis for various aspects of the proposed
method, namely - (1) impact of the length of training sequences on convergence,
(2) relation between character frequencies and the order in which they are
learnt, (3) generalisation ability of our recognition network to inputs of
arbitrary lengths, and (4) impact of varying the text corpus on recognition
accuracy. Finally, we demonstrate excellent text recognition accuracy on both
synthetically generated text images, and scanned images of real printed books,
using no labelled training examples
- …