305 research outputs found
Transfer Learning for OCRopus Model Training on Early Printed Books
A method is presented that significantly reduces the character error rates
for OCR text obtained from OCRopus models trained on early printed books when
only small amounts of diplomatic transcriptions are available. This is achieved
by building from already existing models during training instead of starting
from scratch. To overcome the discrepancies between the set of characters of
the pretrained model and the additional ground truth the OCRopus code is
adapted to allow for alphabet expansion or reduction. The character set is now
capable of flexibly adding and deleting characters from the pretrained alphabet
when an existing model is loaded. For our experiments we use a self-trained
mixed model on early Latin prints and the two standard OCRopus models on modern
English and German Fraktur texts. The evaluation on seven early printed books
showed that training from the Latin mixed model reduces the average amount of
errors by 43% and 26%, respectively compared to training from scratch with 60
and 150 lines of ground truth, respectively. Furthermore, it is shown that even
building from mixed models trained on data unrelated to the newly added
training and test data can lead to significantly improved recognition results
Deep Learning Methods for Dialogue Act Recognition using Visual Information
RozpoznávánĂ dialogovĂ˝ch aktĹŻ (DA) je dĹŻleĹľitĂ˝m krokem v Ĺ™ĂzenĂ a porozumÄ›nĂ dialogu. Tato Ăşloha spoÄŤĂvá v automatickĂ©m pĹ™iĹ™azenĂ tĹ™Ădy k vĂ˝roku/promluvÄ› (nebo jeho části) na základÄ› jeho funkce v dialogu (napĹ™. prohlášenĂ, otázka, potvrzenĂ atd.). Takováto klasifikace pak pomáhá modelovat a identifikovat strukturu spontánnĂch dialogĹŻ. I kdyĹľ je rozpoznávánĂ DA obvykle realizováno na zvukovĂ©m signálu (Ĺ™eÄŤi) pomocĂ modelĹŻ pro automatickĂ© rozpoznávánĂ Ĺ™eÄŤi, dialogy existujĂ rovněž ve formÄ› obrázkĹŻ (napĹ™. komiksy).
Tato práce se zabĂ˝vá automatickĂ˝m rozpoznávánĂm dialogovĂ˝ch aktĹŻ z obrazovĂ˝ch dokumentĹŻ.
Dle nás se jedná o prvnĂ pokus o navrĹľenĂ pĹ™Ăstupu rozpoznávánĂ DA vyuĹľĂvajĂcĂ obrázky jako vstup.
Pro tento Ăşkol je nutnĂ© extrahovat text z obrázkĹŻ. VyuĹľĂváme proto algoritmy z oblasti poÄŤĂtaÄŤovĂ©ho vidÄ›nĂ a~zpracovánĂ obrazu, jako je prahovánĂ obrazu, segmentace textu a optickĂ© rozpoznávánĂ znakĹŻ (OCR). HlavnĂm pĹ™Ănosem v tĂ©to oblasti je návrh a implementace OCR modelu zaloĹľenĂ©ho na konvoluÄŤnĂch a rekurentnĂch neuronovĂ˝ch sĂtĂch. TakĂ© prozkoumáváme rĹŻznĂ© strategie pro trĂ©novánĂ tohoto modelu, vÄŤetnÄ› generovánĂ syntetickĂ˝ch dat a technik rozšiĹ™ovánĂ dat (tzv. augmentace).
Dosahujeme vynikajĂcĂch vĂ˝sledkĹŻ OCR v pĹ™ĂpadÄ›, kdy je malĂ© mnoĹľstvĂ trĂ©novacĂch dat. Mezi naše pĹ™Ănosy tedy patřà to, jak vytvoĹ™it efektivnĂ OCR systĂ©m s~minimálnĂmi náklady na ruÄŤnĂ anotaci.
Dále se zabĂ˝váme vĂcejazyÄŤnostĂ v oblasti rozpoznávánĂ DA. ĂšspěšnÄ› jsme pouĹľili a nasadili obecnĂ˝ model, kterĂ˝ byl trĂ©nován všemi dostupnĂ˝mi jazyky, a takĂ© dalšà modely, kterĂ© byly trĂ©novány pouze na jednom jazyce, a vĂcejazyÄŤnosti je dosaĹľeno pomocĂ transformacĂ sĂ©mantickĂ©ho prostoru.
TakĂ© zkoumáme techniku pĹ™enosu uÄŤenĂ (tzv. transfer learning) pro tuto Ăşlohu tam, kde je k dispozici malĂ˝ poÄŤet anotovanĂ˝ch dat. PouĹľĂváme pĹ™Ăznaky jak na Ăşrovni slov, tak i vÄ›t a naše modely hlubokĂ˝ch neuronovĂ˝ch sĂtĂ (vÄŤetnÄ› architektury Transformer) dosáhly vĂ˝bornĂ˝ch vĂ˝sledkĹŻ v oblasti vĂcejazyÄŤnĂ©ho rozpoznávánĂ dialogovĂ˝ch aktĹŻ.
Pro rozpoznávánĂ DA z obrazovĂ˝ch dokumentĹŻ navrhujeme novĂ˝ multimodálnĂ model zaloĹľenĂ˝ na konvoluÄŤnĂ a rekurentnĂ neuronovĂ© sĂti. Tento model kombinuje textovĂ© a obrazovĂ© vstupy. Textová část zpracovává text z OCR, zatĂmco vizuálnà část extrahuje obrazovĂ© pĹ™Ăznaky, kterĂ© tvořà dalšà vstup do modelu. Text z OCR obsahuje ÄŤasto pĹ™eklepy nebo jinĂ© lexikálnĂ chyby. Demonstrujeme na experimentech, Ĺľe tento multimodálnĂ model vyuĹľĂvajĂcĂ dva vstupy dokáže částeÄŤnÄ› vyvážit ztrátu informace zpĹŻsobenou chybovostĂ OCR systĂ©mu.ObhájenoDialogue act (DA) recognition is an important step of dialogue management and understanding. This task is to automatically assign a label to an utterance (or its part) based on its function in a dialogue (e.g. statement, question, backchannel, etc.). Such utterance-level classification thus helps to model and identify the structure of spontaneous dialogues. Even though DA recognition is usually realized on audio data using an automatic speech recognition engine, the dialogues exist also in a form of images (e.g. comic books).
This thesis deals with automatic dialogue act recognition from image documents.
To the best of our knowledge, this is the first attempt to propose DA recognition approaches using the images as an input.
For this task, it is necessary to extract the text from the images.
Therefore, we employ algorithms from the field of computer vision and image processing such as image thresholding, text segmentation, and optical character recognition (OCR). The main contribution in this field is to design and implement a custom OCR model based on convolutional and recurrent neural networks. We also explore different strategies for training such a~model, including synthetic data generation and data augmentation techniques. We achieve new state-of-the-art OCR results in the constraints when only a few training data are available. Summing up, our contribution is hence also presenting an overview of how to create an efficient OCR system with minimal costs.
We further deal with the multilinguality in the DA recognition field. We successfully employ one general model that was trained by data from all available languages, as well as several models that are trained on a single language, and cross-linguality is achieved by using semantic space transformations. Moreover, we explore transfer learning for DA recognition where there is a small number of annotated data available. We use word-level and utterance-level features and our models contain deep neural network architectures, including Transformers. We obtain new state-of-the-art results in multi- and cross-lingual DA regonition field.
For DA recognition from image documents, we propose and implement a novel multimodal model based on convolutional and recurrent neural network. This model combines text and image inputs. A text part is fed by text tokens from OCR, while the visual part extracts image features that are considered as an auxiliary input. Extracted text from dialogues is often erroneous and contains typos or other lexical errors. We show that the multimodal model deals with the erroneous text and visual information partially balance this loss of information
Adaptive Methods for Robust Document Image Understanding
A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy
Advances in Character Recognition
This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject
- …