13 research outputs found
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin
In this paper we describe a dataset of German and Latin \textit{ground truth}
(GT) for historical OCR in the form of printed text line images paired with
their transcription. This dataset, called \textit{GT4HistOCR}, consists of
313,173 line pairs covering a wide period of printing dates from incunabula
from the 15th century to 19th century books printed in Fraktur types and is
openly available under a CC-BY 4.0 license. The special form of GT as line
image/transcription pairs makes it directly usable to train state-of-the-art
recognition models for OCR software employing recurring neural networks in LSTM
architecture such as Tesseract 4 or OCRopus. We also provide some pretrained
OCRopus models for subcorpora of our dataset yielding between 95\% (early
printings) and 98\% (19th century Fraktur printings) character accuracy rates
on unseen test cases, a Perl script to harmonize GT produced by different
transcription rules, and give hints on how to construct GT for OCR purposes
which has requirements that may differ from linguistically motivated
transcriptions.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
Transfer Learning for OCRopus Model Training on Early Printed Books
A method is presented that significantly reduces the character error rates
for OCR text obtained from OCRopus models trained on early printed books when
only small amounts of diplomatic transcriptions are available. This is achieved
by building from already existing models during training instead of starting
from scratch. To overcome the discrepancies between the set of characters of
the pretrained model and the additional ground truth the OCRopus code is
adapted to allow for alphabet expansion or reduction. The character set is now
capable of flexibly adding and deleting characters from the pretrained alphabet
when an existing model is loaded. For our experiments we use a self-trained
mixed model on early Latin prints and the two standard OCRopus models on modern
English and German Fraktur texts. The evaluation on seven early printed books
showed that training from the Latin mixed model reduces the average amount of
errors by 43% and 26%, respectively compared to training from scratch with 60
and 150 lines of ground truth, respectively. Furthermore, it is shown that even
building from mixed models trained on data unrelated to the newly added
training and test data can lead to significantly improved recognition results
Profiling of OCR'ed Historical Texts Revisited
In the absence of ground truth it is not possible to automatically determine
the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for
interactive postcorrection of OCR'ed historical printings it is extremely
useful to have a statistical profile available that provides an estimate of
error classes with associated frequencies, and that points to conjectured
errors and suspicious tokens. The method introduced in Reffle (2013) computes
such a profile, combining lexica, pattern sets and advanced matching techniques
in a specialized Expectation Maximization (EM) procedure. Here we improve this
method in three respects: First, the method in Reffle (2013) is not adaptive:
user feedback obtained by actual postcorrection steps cannot be used to compute
refined profiles. We introduce a variant of the method that is open for
adaptivity, taking correction steps of the user into account. This leads to
higher precision with respect to recognition of erroneous OCR tokens. Second,
during postcorrection often new historical patterns are found. We show that
adding new historical patterns to the linguistic background resources leads to
a second kind of improvement, enabling even higher precision by telling
historical spellings apart from OCR errors. Third, the method in Reffle (2013)
does not make any active use of tokens that cannot be interpreted in the
underlying channel model. We show that adding these uninterpretable tokens to
the set of conjectured errors leads to a significant improvement of the recall
for error detection, at the same time improving precision
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning
We combine three methods which significantly improve the OCR accuracy of OCR
models trained on early printed books: (1) The pretraining method utilizes the
information stored in already existing models trained on a variety of typesets
(mixed models) instead of starting the training from scratch. (2) Performing
cross fold training on a single set of ground truth data (line images and their
transcriptions) with a single OCR engine (OCRopus) produces a committee whose
members then vote for the best outcome by also taking the top-N alternatives
and their intrinsic confidence values into account. (3) Following the principle
of maximal disagreement we select additional training lines which the voters
disagree most on, expecting them to offer the highest information gain for a
subsequent training (active learning). Evaluations on six early printed books
yielded the following results: On average the combination of pretraining and
voting improved the character accuracy by 46% when training five folds starting
from the same mixed model. This number rose to 53% when using different models
for pretraining, underlining the importance of diverse voters. Incorporating
active learning improved the obtained results by another 16% on average
(evaluated on three of the six books). Overall, the proposed methods lead to an
average error rate of 2.5% when training on only 60 lines. Using a substantial
ground truth pool of 1,000 lines brought the error rate down even further to
less than 1% on average.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
LatMor: A Latin Finite-State Morphology Encoding Vowel Quantity
We present the first large-coverage finite-state open-source morphology for Latin (called LatMor) which parses as well as generates vowel quantity information. LatMor is based on the Berlin Latin Lexicon comprising about 70,000 lemmata of classical Latin compiled by the group of Dietmar Najock in their work on concordances of Latin authors (see Rapsch and Najock, 1991) which was recently updated by us. Compared to the well-known Morpheus system of Crane (1991, 1998), which is written in the C programming language, based on 50,000 lemmata of Lewis and Short (1907), not well documented and therefore not easily extended, our new morphology has a larger vocabulary, is about 60 to 1200 times faster and is built in the form of finite-state transducers which can analyze as well as generate wordforms and represent the state-of-the-art implementation method in computational morphology. The current coverage of LatMor is evaluated against Morpheus and other existing systems (some of which are not openly accessible), and is shown to rank first among all systems together with the Pisa LEMLAT morphology (not yet openly accessible). Recall has been analyzed taking the Latin Dependency Treebank(1) as gold data and the remaining defect classes have been identified. LatMor is available under an open source licence to allow its wide usage by all interested parties
State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines
In this paper we evaluate Optical Character Recognition (OCR) of 19th century
Fraktur scripts without book-specific training using mixed models, i.e. models
trained to recognize a variety of fonts and typesets from previously unseen
sources. We describe the training process leading to strong mixed OCR models
and compare them to freely available models of the popular open source engines
OCRopus and Tesseract as well as the commercial state of the art system ABBYY.
For evaluation, we use a varied collection of unseen data from books, journals,
and a dictionary from the 19th century. The experiments show that training
mixed models with real data is superior to training with synthetic data and
that the novel OCR engine Calamari outperforms the other engines considerably,
on average reducing ABBYYs character error rate (CER) by over 70%, resulting in
an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a...
creative... submission format. Consequently, some captions might look weird
and some links aren't clickable. Extended version with more technical details
and some fixes to follo
Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts
Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a statistical profile for an OCR-ed input text. We also look in detail at the influence of the profiler resources and other settings selected for training and evaluation. As a practical result of several fine-tuning steps, a general post-correction model is achieved where experiments for a large and heterogeneous collection of OCR-ed historical texts show a consistent improvement of base OCR accuracy. The results presented are meant to provide insights for libraries that want to apply OCR post-correction to a larger spectrum of distinct OCR-ed historical printings and ask for "representative" results
CIS OCR Workshop v1.0: OCR and postcorrection of early printings for digital humanities
The 2-day CIS OCR Workshop on "OCR and postcorrection of early printings for digital humanities" originally held at LMU, Munich 14/15 September 2015 (see http://www.cis.lmu.de/ocrworkshop).
Release date: 2016-02-25
CIS OCR Workshop by Uwe Springmann, Florian Fink is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License