126 research outputs found
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning
We combine three methods which significantly improve the OCR accuracy of OCR
models trained on early printed books: (1) The pretraining method utilizes the
information stored in already existing models trained on a variety of typesets
(mixed models) instead of starting the training from scratch. (2) Performing
cross fold training on a single set of ground truth data (line images and their
transcriptions) with a single OCR engine (OCRopus) produces a committee whose
members then vote for the best outcome by also taking the top-N alternatives
and their intrinsic confidence values into account. (3) Following the principle
of maximal disagreement we select additional training lines which the voters
disagree most on, expecting them to offer the highest information gain for a
subsequent training (active learning). Evaluations on six early printed books
yielded the following results: On average the combination of pretraining and
voting improved the character accuracy by 46% when training five folds starting
from the same mixed model. This number rose to 53% when using different models
for pretraining, underlining the importance of diverse voters. Incorporating
active learning improved the obtained results by another 16% on average
(evaluated on three of the six books). Overall, the proposed methods lead to an
average error rate of 2.5% when training on only 60 lines. Using a substantial
ground truth pool of 1,000 lines brought the error rate down even further to
less than 1% on average.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
Transfer Learning for OCRopus Model Training on Early Printed Books
A method is presented that significantly reduces the character error rates
for OCR text obtained from OCRopus models trained on early printed books when
only small amounts of diplomatic transcriptions are available. This is achieved
by building from already existing models during training instead of starting
from scratch. To overcome the discrepancies between the set of characters of
the pretrained model and the additional ground truth the OCRopus code is
adapted to allow for alphabet expansion or reduction. The character set is now
capable of flexibly adding and deleting characters from the pretrained alphabet
when an existing model is loaded. For our experiments we use a self-trained
mixed model on early Latin prints and the two standard OCRopus models on modern
English and German Fraktur texts. The evaluation on seven early printed books
showed that training from the Latin mixed model reduces the average amount of
errors by 43% and 26%, respectively compared to training from scratch with 60
and 150 lines of ground truth, respectively. Furthermore, it is shown that even
building from mixed models trained on data unrelated to the newly added
training and test data can lead to significantly improved recognition results
Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus
State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines
In this paper we evaluate Optical Character Recognition (OCR) of 19th century
Fraktur scripts without book-specific training using mixed models, i.e. models
trained to recognize a variety of fonts and typesets from previously unseen
sources. We describe the training process leading to strong mixed OCR models
and compare them to freely available models of the popular open source engines
OCRopus and Tesseract as well as the commercial state of the art system ABBYY.
For evaluation, we use a varied collection of unseen data from books, journals,
and a dictionary from the 19th century. The experiments show that training
mixed models with real data is superior to training with synthetic data and
that the novel OCR engine Calamari outperforms the other engines considerably,
on average reducing ABBYYs character error rate (CER) by over 70%, resulting in
an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a...
creative... submission format. Consequently, some captions might look weird
and some links aren't clickable. Extended version with more technical details
and some fixes to follo
- …