56 research outputs found
Towards robust real-world historical handwriting recognition
In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data
An Interpretable Deep Architecture for Similarity Learning Built Upon Hierarchical Concepts
In general, development of adequately complex mathematical models, such as deep neural networks, can be an effective way to improve the accuracy of learning models. However, this is achieved at the cost of reduced post-hoc model interpretability, because what is learned by the model can become less intelligible and tractable to humans as the model complexity increases. In this paper, we target a similarity learning task in the context of image retrieval, with a focus on the model interpretability issue. An effective similarity neural network (SNN) is proposed not only to seek robust retrieval performance but also to achieve satisfactory post-hoc interpretability. The network is designed by linking the neuron architecture with the organization of a concept tree and by formulating neuron operations to pass similarity information between concepts. Various ways of understanding and visualizing what is learned by the SNN neurons are proposed. We also exhaustively evaluate the proposed approach using a number of relevant datasets against a number of state-of-the-art approaches to demonstrate the effectiveness of the proposed network. Our results show that the proposed approach can offer superior performance when compared against state-of-the-art approaches. Neuron visualization results are demonstrated to support the understanding of the trained neurons
Historical document analysis based on word matching
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 67-76.Historical documents constitute a heritage which should be preserved and providing
automatic retrieval and indexing scheme for these archives would be beneficial
for researchers from several disciplines and countries. Unfortunately, applying ordinary
Optical Character Recognition (OCR) techniques on these documents is
nearly impossible, since these documents are degraded and deformed. Recently,
word matching methods are proposed to access these documents. In this thesis,
two historical document analysis problems, word segmentation in historical
documents and Islamic pattern matching in kufic images are tackled based on
word matching. In the first task, a cross document word matching based approach
is proposed to segment historical documents into words. A version of a
document, in which word segmentation is easy, is used as a source data set and
another version in a different writing style, which is more difficult to segment
into words, is used as a target data set. The source data set is segmented into
words by a simple method and extracted words are used as queries to be spotted
in the target data set. Experiments on an Ottoman data set show that cross
document word matching is a promising method to segment historical documents
into words. In the second task, firstly lines are extracted and sub-patterns are
automatically detected in the images. Then sub-patterns are matched based on a
line representation in two ways: by their chain code representation and by their
shape contexts. Promising results are obtained for finding the instances of a query
pattern and for fully automatic detection of repeating patterns on a square kufic
image collection.ArifoÄźlu, DamlaM.S
DSG: An End-to-End Document Structure Generator
Information in industry, research, and the public sector is widely stored as
rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks,
systems are needed that map rendered documents onto a structured hierarchical
format. However, existing systems for this task are limited by heuristics and
are not end-to-end trainable. In this work, we introduce the Document Structure
Generator (DSG), a novel system for document parsing that is fully end-to-end
trainable. DSG combines a deep neural network for parsing (i) entities in
documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that
capture the sequence and nested structure between entities. Unlike existing
systems that rely on heuristics, our DSG is trained end-to-end, making it
effective and flexible for real-world applications. We further contribute a
new, large-scale dataset called E-Periodica comprising real-world magazines
with complex document structures for evaluation. Our results demonstrate that
our DSG outperforms commercial OCR tools and, on top of that, achieves
state-of-the-art performance. To the best of our knowledge, our DSG system is
the first end-to-end trainable system for hierarchical document parsing.Comment: Accepted at ICDM 202
Text Detection in Natural Scenes and Technical Diagrams with Convolutional Feature Learning and Cascaded Classification
An enormous amount of digital images are being generated and stored every day. Understanding text in these images is an important challenge with large impacts for academic, industrial and domestic applications. Recent studies address the difficulty of separating text targets from noise and background, all of which vary greatly in natural scenes. To tackle this problem, we develop a text detection system to analyze and utilize visual information in a data driven, automatic and intelligent way.
The proposed method incorporates features learned from data, including patch-based coarse-to-fine detection (Text-Conv), connected component extraction using region growing, and graph-based word segmentation (Word-Graph). Text-Conv is a sliding window-based detector, with convolution masks learned using the Convolutional k-means algorithm (Coates et. al, 2011). Unlike convolutional neural networks (CNNs), a single vector/layer of convolution mask responses are used to classify patches. An initial coarse detection considers both local and neighboring patch responses, followed by refinement using varying aspect ratios and rotations for a smaller local detection window. Different levels of visual detail from ground truth are utilized in each step, first using constraints on bounding box intersections, and then a combination of bounding box and pixel intersections. Combining masks from different Convolutional k-means initializations, e.g., seeded using random vectors and then support vectors improves performance. The Word-Graph algorithm uses contextual information to improve word segmentation and prune false character detections based on visual features and spatial context. Our system obtains pixel, character, and word detection f-measures of 93.14%, 90.26%, and 86.77% respectively for the ICDAR 2015 Robust Reading Focused Scene Text dataset, out-performing state-of-the-art systems, and producing highly accurate text detection masks at the pixel level.
To investigate the utility of our feature learning approach for other image types, we perform tests on 8- bit greyscale USPTO patent drawing diagram images. An ensemble of Ada-Boost classifiers with different convolutional features (MetaBoost) is used to classify patches as text or background. The Tesseract OCR system is used to recognize characters in detected labels and enhance performance. With appropriate pre-processing and post-processing, f-measures of 82% for part label location, and 73% for valid part label locations and strings are obtained, which are the best obtained to-date for the USPTO patent diagram data set used in our experiments.
To sum up, an intelligent refinement of convolutional k-means-based feature learning and novel automatic classification methods are proposed for text detection, which obtain state-of-the-art results without the need for strong prior knowledge. Different ground truth representations along with features including edges, color, shape and spatial relationships are used coherently to improve accuracy. Different variations of feature learning are explored, e.g. support vector-seeded clustering and MetaBoost, with results suggesting that increased diversity in learned features benefit convolution-based text detectors
On Authorship Attribution
Authorship attribution is the process of identifying the author of a given text and from
the machine learning perspective, it can be seen as a classification problem. In the
literature, there are a lot of classification methods for which feature extraction techniques
are conducted. In this thesis, we explore information retrieval techniques such as Doc2Vec
and other useful feature selection and extraction techniques for a given text with different
classifiers. The main purpose of this work is to lay the foundations of feature extraction
techniques in authorship attribution. At the end of this work, we show how we compared
our results with related works and how we managed to improve, to the best of our
knowledge, the results on a particular dataset, very known in this field
Analyse d’images de documents patrimoniaux : une approche structurelle à base de texture
Over the last few years, there has been tremendous growth in digitizing collections of cultural heritage documents. Thus, many challenges and open issues have been raised, such as information retrieval in digital libraries or analyzing page content of historical books. Recently, an important need has emerged which consists in designing a computer-aided characterization and categorization tool, able to index or group historical digitized book pages according to several criteria, mainly the layout structure and/or typographic/graphical characteristics of the historical document image content. Thus, the work conducted in this thesis presents an automatic approach for characterization and categorization of historical book pages. The proposed approach is applicable to a large variety of ancient books. In addition, it does not assume a priori knowledge regarding document image layout and content. It is based on the use of texture and graph algorithms to provide a rich and holistic description of the layout and content of the analyzed book pages to characterize and categorize historical book pages. The categorization is based on the characterization of the digitized page content by texture, shape, geometric and topological descriptors. This characterization is represented by a structural signature. More precisely, the signature-based characterization approach consists of two main stages. The first stage is extracting homogeneous regions. Then, the second one is proposing a graph-based page signature which is based on the extracted homogeneous regions, reflecting its layout and content. Afterwards, by comparing the different obtained graph-based signatures using a graph-matching paradigm, the similarities of digitized historical book page layout and/or content can be deduced. Subsequently, book pages with similar layout and/or content can be categorized and grouped, and a table of contents/summary of the analyzed digitized historical book can be provided automatically. As a consequence, numerous signature-based applications (e.g. information retrieval in digital libraries according to several criteria, page categorization) can be implemented for managing effectively a corpus or collections of books. To illustrate the effectiveness of the proposed page signature, a detailed experimental evaluation has been conducted in this work for assessing two possible categorization applications, unsupervised page classification and page stream segmentation. In addition, the different steps of the proposed approach have been evaluated on a large variety of historical document images.Les récents progrès dans la numérisation des collections de documents patrimoniaux ont ravivé de nouveaux défis afin de garantir une conservation durable et de fournir un accès plus large aux documents anciens. En parallèle de la recherche d'information dans les bibliothèques numériques ou l'analyse du contenu des pages numérisées dans les ouvrages anciens, la caractérisation et la catégorisation des pages d'ouvrages anciens a connu récemment un regain d'intérêt. Les efforts se concentrent autant sur le développement d'outils rapides et automatiques de caractérisation et catégorisation des pages d'ouvrages anciens, capables de classer les pages d'un ouvrage numérisé en fonction de plusieurs critères, notamment la structure des mises en page et/ou les caractéristiques typographiques/graphiques du contenu de ces pages. Ainsi, dans le cadre de cette thèse, nous proposons une approche permettant la caractérisation et la catégorisation automatiques des pages d'un ouvrage ancien. L'approche proposée se veut indépendante de la structure et du contenu de l'ouvrage analysé. Le principal avantage de ce travail réside dans le fait que l'approche s'affranchit des connaissances préalables, que ce soit concernant le contenu du document ou sa structure. Elle est basée sur une analyse des descripteurs de texture et une représentation structurelle en graphe afin de fournir une description riche permettant une catégorisation à partir du contenu graphique (capturé par la texture) et des mises en page (représentées par des graphes). En effet, cette catégorisation s'appuie sur la caractérisation du contenu de la page numérisée à l'aide d'une analyse des descripteurs de texture, de forme, géométriques et topologiques. Cette caractérisation est définie à l'aide d'une représentation structurelle. Dans le détail, l'approche de catégorisation se décompose en deux étapes principales successives. La première consiste à extraire des régions homogènes. La seconde vise à proposer une signature structurelle à base de texture, sous la forme d'un graphe, construite à partir des régions homogènes extraites et reflétant la structure de la page analysée. Cette signature assure la mise en œuvre de nombreuses applications pour gérer efficacement un corpus ou des collections de livres patrimoniaux (par exemple, la recherche d'information dans les bibliothèques numériques en fonction de plusieurs critères, ou la catégorisation des pages d'un même ouvrage). En comparant les différentes signatures structurelles par le biais de la distance d'édition entre graphes, les similitudes entre les pages d'un même ouvrage en termes de leurs mises en page et/ou contenus peuvent être déduites. Ainsi de suite, les pages ayant des mises en page et/ou contenus similaires peuvent être catégorisées, et un résumé/une table des matières de l'ouvrage analysé peut être alors généré automatiquement. Pour illustrer l'efficacité de la signature proposée, une étude expérimentale détaillée a été menée dans ce travail pour évaluer deux applications possibles de catégorisation de pages d'un même ouvrage, la classification non supervisée de pages et la segmentation de flux de pages d'un même ouvrage. En outre, les différentes étapes de l'approche proposée ont donné lieu à des évaluations par le biais d'expérimentations menées sur un large corpus de documents patrimoniaux
Deep Learning Methods for Dialogue Act Recognition using Visual Information
RozpoznávánĂ dialogovĂ˝ch aktĹŻ (DA) je dĹŻleĹľitĂ˝m krokem v Ĺ™ĂzenĂ a porozumÄ›nĂ dialogu. Tato Ăşloha spoÄŤĂvá v automatickĂ©m pĹ™iĹ™azenĂ tĹ™Ădy k vĂ˝roku/promluvÄ› (nebo jeho části) na základÄ› jeho funkce v dialogu (napĹ™. prohlášenĂ, otázka, potvrzenĂ atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánnĂch dialogĹŻ. I kdyĹľ je rozpoznávánĂ DA obvykle realizováno na zvukovĂ©m signálu (Ĺ™eÄŤi) pomocĂ modelĹŻ pro automatickĂ© rozpoznávánĂ Ĺ™eÄŤi, dialogy existujĂ rovněž ve formÄ› obrázkĹŻ (napĹ™. komiksy).
Tato práce se zabĂ˝vá automatickĂ˝m rozpoznávánĂm dialogovĂ˝ch aktĹŻ z obrazovĂ˝ch dokumentĹŻ.
Dle nás se jedná o prvnĂ pokus o navrĹľenĂ pĹ™Ăstupu rozpoznávánĂ DA vyuĹľĂvajĂcĂ obrázky jako vstup.
Pro tento Ăşkol je nutnĂ© extrahovat text z obrázkĹŻ. VyuĹľĂváme proto algoritmy z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ a~zpracovánĂ obrazu, jako je prahovánĂ obrazu, segmentace textu a optickĂ© rozpoznávánĂ znakĹŻ (OCR). HlavnĂm pĹ™Ănosem v tĂ©to oblasti je návrh a implementace OCR modelu zaloĹľenĂ©ho na konvoluÄŤnĂch a rekurentnĂch neuronovĂ˝ch sĂtĂch. TakĂ© prozkoumáváme rĹŻznĂ© strategie pro trĂ©novánĂ tohoto modelu, vÄŤetnÄ› generovánĂ syntetickĂ˝ch dat a technik rozšiĹ™ovánĂ dat (tzv. augmentace).
Dosahujeme vynikajĂcĂch vĂ˝sledkĹŻ OCR v pĹ™ĂpadÄ›, kdy je malĂ© mnoĹľstvĂ trĂ©novacĂch dat. Mezi naše pĹ™Ănosy tedy patřà to, jak vytvoĹ™it efektivnĂ OCR systĂ©m s~minimálnĂmi náklady na ruÄŤnĂ anotaci.
Dále se zabĂ˝váme vĂcejazyÄŤnostĂ v oblasti rozpoznávánĂ DA. ĂšspěšnÄ› jsme pouĹľili a nasadili obecnĂ˝ model, kterĂ˝ byl trĂ©nován všemi dostupnĂ˝mi jazyky, a takĂ© dalšà modely, kterĂ© byly trĂ©novány pouze na jednom jazyce, a vĂcejazyÄŤnosti je dosaĹľeno pomocĂ transformacĂ sĂ©mantickĂ©ho prostoru.
TakĂ© zkoumáme techniku pĹ™enosu uÄŤenĂ (tzv. transfer learning) pro tuto Ăşlohu tam, kde je k dispozici malĂ˝ poÄŤet anotovanĂ˝ch dat. PouĹľĂváme pĹ™Ăznaky jak na Ăşrovni slov, tak i vÄ›t a naše modely hlubokĂ˝ch neuronovĂ˝ch sĂtĂ (vÄŤetnÄ› architektury Transformer) dosáhly vĂ˝bornĂ˝ch vĂ˝sledkĹŻ v oblasti vĂcejazyÄŤnĂ©ho rozpoznávánĂ dialogovĂ˝ch aktĹŻ.
Pro rozpoznávánĂ DA z obrazovĂ˝ch dokumentĹŻ navrhujeme novĂ˝ multimodálnĂ model zaloĹľenĂ˝ na konvoluÄŤnĂ a rekurentnĂ neuronovĂ© sĂti. Tento model kombinuje textovĂ© a obrazovĂ© vstupy. Textová část zpracovává text z OCR, zatĂmco vizuálnà část extrahuje obrazovĂ© pĹ™Ăznaky, kterĂ© tvořà dalšà vstup do modelu. Text z OCR obsahuje ÄŤasto pĹ™eklepy nebo jinĂ© lexikálnĂ chyby. Demonstrujeme na experimentech, Ĺľe tento multimodálnĂ model vyuĹľĂvajĂcĂ dva vstupy dokáže částeÄŤnÄ› vyvážit ztrátu informace zpĹŻsobenou chybovostĂ OCR systĂ©mu.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books).
This thesis deals with automatic dialogue act recognition from image documents.
To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input.
For this task, it is necessary to extract the text from the images.
Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs.
We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field.
For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information
- …