83 research outputs found
Transfer Learning for OCRopus Model Training on Early Printed Books
A method is presented that significantly reduces the character error rates
for OCR text obtained from OCRopus models trained on early printed books when
only small amounts of diplomatic transcriptions are available. This is achieved
by building from already existing models during training instead of starting
from scratch. To overcome the discrepancies between the set of characters of
the pretrained model and the additional ground truth the OCRopus code is
adapted to allow for alphabet expansion or reduction. The character set is now
capable of flexibly adding and deleting characters from the pretrained alphabet
when an existing model is loaded. For our experiments we use a self-trained
mixed model on early Latin prints and the two standard OCRopus models on modern
English and German Fraktur texts. The evaluation on seven early printed books
showed that training from the Latin mixed model reduces the average amount of
errors by 43% and 26%, respectively compared to training from scratch with 60
and 150 lines of ground truth, respectively. Furthermore, it is shown that even
building from mixed models trained on data unrelated to the newly added
training and test data can lead to significantly improved recognition results
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin
In this paper we describe a dataset of German and Latin \textit{ground truth}
(GT) for historical OCR in the form of printed text line images paired with
their transcription. This dataset, called \textit{GT4HistOCR}, consists of
313,173 line pairs covering a wide period of printing dates from incunabula
from the 15th century to 19th century books printed in Fraktur types and is
openly available under a CC-BY 4.0 license. The special form of GT as line
image/transcription pairs makes it directly usable to train state-of-the-art
recognition models for OCR software employing recurring neural networks in LSTM
architecture such as Tesseract 4 or OCRopus. We also provide some pretrained
OCRopus models for subcorpora of our dataset yielding between 95\% (early
printings) and 98\% (19th century Fraktur printings) character accuracy rates
on unseen test cases, a Perl script to harmonize GT produced by different
transcription rules, and give hints on how to construct GT for OCR purposes
which has requirements that may differ from linguistically motivated
transcriptions.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning
We combine three methods which significantly improve the OCR accuracy of OCR
models trained on early printed books: (1) The pretraining method utilizes the
information stored in already existing models trained on a variety of typesets
(mixed models) instead of starting the training from scratch. (2) Performing
cross fold training on a single set of ground truth data (line images and their
transcriptions) with a single OCR engine (OCRopus) produces a committee whose
members then vote for the best outcome by also taking the top-N alternatives
and their intrinsic confidence values into account. (3) Following the principle
of maximal disagreement we select additional training lines which the voters
disagree most on, expecting them to offer the highest information gain for a
subsequent training (active learning). Evaluations on six early printed books
yielded the following results: On average the combination of pretraining and
voting improved the character accuracy by 46% when training five folds starting
from the same mixed model. This number rose to 53% when using different models
for pretraining, underlining the importance of diverse voters. Incorporating
active learning improved the obtained results by another 16% on average
(evaluated on three of the six books). Overall, the proposed methods lead to an
average error rate of 2.5% when training on only 60 lines. Using a substantial
ground truth pool of 1,000 lines brought the error rate down even further to
less than 1% on average.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus
Autonomous Quadrocopter for Search, Count and Localization of Objects
This chapter describes and evaluates the design and implementation of a new fully autonomous quadrocopter, which is capable of selfâreliant search, count and localization of a predefined object on the ground inside a room
Mechanisms and factors determining DSB repair pathway choice in G2
Aim of this work was to investigate the interplay between the different DNA double-strand break (DSB) repair pathways during the G2 phase of the cell cycle. In G2, DSBs which are located in euchromatic regions are repaired with fast kinetics via canonical NHEJ (c-NHEJ), whereas heterochromatic DSBs are repaired with slow kinetics via homologous recombination (HR). C-NHEJ comprises a ligation of both DSB ends without the requirement of sequence homology. HR is a repair pathway, where the DSB ends are resected to produce ssDNA that invades the sister chromatid and uses the sequence as a template for error-free repair. If cells are deficient in the HR core factors BRCA2 or RAD51, the DSBs are resected but remain unrepaired. This can lead to genomic instability, less cell survival and cancer.
The presence of ssDNA itself might explain why c-NHEJ does not repair resected DSBs in a BRCA2 deficient cell to prevent an accumulation of unrepaired DSBs. But an alternative NHEJ (alt-NHEJ) process is described, which uses microhomologies within the ssDNA to ligate both resected DSB ends. Therefore we sought to further characterize resected DSBs in G2 and observed an ATM release at resected DSBs. In G1, ATM is assembled at DSBs and facilitates the repair of heterochromatic DSBs by heterochromatin relaxation due to the phosphorylation of the heterochromatin building factor KAP-1. Contrary to G1, in G2 is ATM needed to initiate resection but is dispensable for later stages of HR. A permanent heterochromatin relaxation by downregulation of KAP-1 or expression of a phosphomimic form of KAP-1 allows the repair of resected DSBs in BRCA2- or RAD51-deficient cells by error-prone alt-NHEJ. Moreover, in HR proficient cells a KAP-1 depletion causes a switch from HR to alt-NHEJ repair, too.
We support a model, where the heterochromatin is initially relaxed, but after extended resection, the heterochromatin is reconstituted due to the release of ATM and the dephosphorylation of KAP-1. The restored heterochromatin structure now facilitates error-free HR and prevents the usage of error-prone alt-NHEJ.
Secondarily, we investigated the mechanistic reason of the ATM release at resected DSBs. The cascade of the assembly of ATM at DSBs involves first the phosphorylation of H2AX by ATM itself and the binding of MDC1 to this phosphorylation. ATM phosphorylates MDC1 to allow the binding of the ubiquitin ligase RNF8, which together with RNF168, ubiquitinates
Summary 4
the histone H2A/H2AX and the demethylase JMJD2A. JMJD2A is bound at H4K20me2 and degraded after its ubiquitination. After the degradation of JMJD2A, 53BP1 has the ability to bind H4K20me2 that in turn allows the assembly of ATM at the DSB site.
We were able to show that at resected DSBs, 53BP1 is released and RNF8/168 actity is decreased, whereas H2AX phosphorylation and MDC1 binding are not affected. A switch from ATM to ATR activity at resected DSBs allows H2AX phosphorylation and MDC1 binding. But ATR cannot phosphorylate MDC1, so RNF8/168 activation is impaired. Without the RNF8/168 activity, 53BP1 cannot bind H4K20me2 and assemble ATM at the resected break. This leads to a heterochromatin reconstitution, which facilitates HR and prevents alt-NHEJ.
A co-depletion of JMJD2A and JMJD2B is described to allow 53BP1 binding in RNF8/168 deficient cells. This co-depletion or using a phosphomimic form of MDC1, which mimics a permanent phoshporylation to allow RNF8/168 activity at resected DSBs, allows the repair of heterochromatic DSBs in BRCA2-deficient cells. We suggest that under such conditions cells switch to alt-NHEJ instead of using HR, equal to a KAP-1 knockdown.
In summary, our results provide a model where the resection is the most important step of the HR process, which determines the repair of a heterochromatic DSB to HR and exclude end-joining repair: not the resection per se, but rather the heterochromatin reconstitution in consequence of ATM release at resected DSBs. ATM is released due to the inability of ATR to phosphorylate MDC1to trigger RNF8/168 activition. We suggest that without RNF8/168 activity, JMJD2A replaces 53BP1 at resected DSBs. Without 53BP1, ATM is released and the heterochromatin structure is reconstituted
State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines
In this paper we evaluate Optical Character Recognition (OCR) of 19th century
Fraktur scripts without book-specific training using mixed models, i.e. models
trained to recognize a variety of fonts and typesets from previously unseen
sources. We describe the training process leading to strong mixed OCR models
and compare them to freely available models of the popular open source engines
OCRopus and Tesseract as well as the commercial state of the art system ABBYY.
For evaluation, we use a varied collection of unseen data from books, journals,
and a dictionary from the 19th century. The experiments show that training
mixed models with real data is superior to training with synthetic data and
that the novel OCR engine Calamari outperforms the other engines considerably,
on average reducing ABBYYs character error rate (CER) by over 70%, resulting in
an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a...
creative... submission format. Consequently, some captions might look weird
and some links aren't clickable. Extended version with more technical details
and some fixes to follo
- âŠ