83 research outputs found

    Transfer Learning for OCRopus Model Training on Early Printed Books

    Full text link
    A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretrained model and the additional ground truth the OCRopus code is adapted to allow for alphabet expansion or reduction. The character set is now capable of flexibly adding and deleting characters from the pretrained alphabet when an existing model is loaded. For our experiments we use a self-trained mixed model on early Latin prints and the two standard OCRopus models on modern English and German Fraktur texts. The evaluation on seven early printed books showed that training from the Latin mixed model reduces the average amount of errors by 43% and 26%, respectively compared to training from scratch with 60 and 150 lines of ground truth, respectively. Furthermore, it is shown that even building from mixed models trained on data unrelated to the newly added training and test data can lead to significantly improved recognition results

    Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

    Get PDF
    In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognitio

    Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

    Get PDF
    We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognitio

    Autonomous Quadrocopter for Search, Count and Localization of Objects

    Get PDF
    This chapter describes and evaluates the design and implementation of a new fully autonomous quadrocopter, which is capable of self‐reliant search, count and localization of a predefined object on the ground inside a room

    Mechanisms and factors determining DSB repair pathway choice in G2

    Get PDF
    Aim of this work was to investigate the interplay between the different DNA double-strand break (DSB) repair pathways during the G2 phase of the cell cycle. In G2, DSBs which are located in euchromatic regions are repaired with fast kinetics via canonical NHEJ (c-NHEJ), whereas heterochromatic DSBs are repaired with slow kinetics via homologous recombination (HR). C-NHEJ comprises a ligation of both DSB ends without the requirement of sequence homology. HR is a repair pathway, where the DSB ends are resected to produce ssDNA that invades the sister chromatid and uses the sequence as a template for error-free repair. If cells are deficient in the HR core factors BRCA2 or RAD51, the DSBs are resected but remain unrepaired. This can lead to genomic instability, less cell survival and cancer. The presence of ssDNA itself might explain why c-NHEJ does not repair resected DSBs in a BRCA2 deficient cell to prevent an accumulation of unrepaired DSBs. But an alternative NHEJ (alt-NHEJ) process is described, which uses microhomologies within the ssDNA to ligate both resected DSB ends. Therefore we sought to further characterize resected DSBs in G2 and observed an ATM release at resected DSBs. In G1, ATM is assembled at DSBs and facilitates the repair of heterochromatic DSBs by heterochromatin relaxation due to the phosphorylation of the heterochromatin building factor KAP-1. Contrary to G1, in G2 is ATM needed to initiate resection but is dispensable for later stages of HR. A permanent heterochromatin relaxation by downregulation of KAP-1 or expression of a phosphomimic form of KAP-1 allows the repair of resected DSBs in BRCA2- or RAD51-deficient cells by error-prone alt-NHEJ. Moreover, in HR proficient cells a KAP-1 depletion causes a switch from HR to alt-NHEJ repair, too. We support a model, where the heterochromatin is initially relaxed, but after extended resection, the heterochromatin is reconstituted due to the release of ATM and the dephosphorylation of KAP-1. The restored heterochromatin structure now facilitates error-free HR and prevents the usage of error-prone alt-NHEJ. Secondarily, we investigated the mechanistic reason of the ATM release at resected DSBs. The cascade of the assembly of ATM at DSBs involves first the phosphorylation of H2AX by ATM itself and the binding of MDC1 to this phosphorylation. ATM phosphorylates MDC1 to allow the binding of the ubiquitin ligase RNF8, which together with RNF168, ubiquitinates Summary 4 the histone H2A/H2AX and the demethylase JMJD2A. JMJD2A is bound at H4K20me2 and degraded after its ubiquitination. After the degradation of JMJD2A, 53BP1 has the ability to bind H4K20me2 that in turn allows the assembly of ATM at the DSB site. We were able to show that at resected DSBs, 53BP1 is released and RNF8/168 actity is decreased, whereas H2AX phosphorylation and MDC1 binding are not affected. A switch from ATM to ATR activity at resected DSBs allows H2AX phosphorylation and MDC1 binding. But ATR cannot phosphorylate MDC1, so RNF8/168 activation is impaired. Without the RNF8/168 activity, 53BP1 cannot bind H4K20me2 and assemble ATM at the resected break. This leads to a heterochromatin reconstitution, which facilitates HR and prevents alt-NHEJ. A co-depletion of JMJD2A and JMJD2B is described to allow 53BP1 binding in RNF8/168 deficient cells. This co-depletion or using a phosphomimic form of MDC1, which mimics a permanent phoshporylation to allow RNF8/168 activity at resected DSBs, allows the repair of heterochromatic DSBs in BRCA2-deficient cells. We suggest that under such conditions cells switch to alt-NHEJ instead of using HR, equal to a KAP-1 knockdown. In summary, our results provide a model where the resection is the most important step of the HR process, which determines the repair of a heterochromatic DSB to HR and exclude end-joining repair: not the resection per se, but rather the heterochromatin reconstitution in consequence of ATM release at resected DSBs. ATM is released due to the inability of ATR to phosphorylate MDC1to trigger RNF8/168 activition. We suggest that without RNF8/168 activity, JMJD2A replaces 53BP1 at resected DSBs. Without 53BP1, ATM is released and the heterochromatin structure is reconstituted

    State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

    Full text link
    In this paper we evaluate Optical Character Recognition (OCR) of 19th century Fraktur scripts without book-specific training using mixed models, i.e. models trained to recognize a variety of fonts and typesets from previously unseen sources. We describe the training process leading to strong mixed OCR models and compare them to freely available models of the popular open source engines OCRopus and Tesseract as well as the commercial state of the art system ABBYY. For evaluation, we use a varied collection of unseen data from books, journals, and a dictionary from the 19th century. The experiments show that training mixed models with real data is superior to training with synthetic data and that the novel OCR engine Calamari outperforms the other engines considerably, on average reducing ABBYYs character error rate (CER) by over 70%, resulting in an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a... creative... submission format. Consequently, some captions might look weird and some links aren't clickable. Extended version with more technical details and some fixes to follo
    • 

    corecore