40,240 research outputs found
An automatic part-of-speech tagger for Middle Low German
Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them
Handwriting Recognition of Historical Documents with few labeled data
Historical documents present many challenges for offline handwriting
recognition systems, among them, the segmentation and labeling steps. Carefully
annotated textlines are needed to train an HTR system. In some scenarios,
transcripts are only available at the paragraph level with no text-line
information. In this work, we demonstrate how to train an HTR system with few
labeled data. Specifically, we train a deep convolutional recurrent neural
network (CRNN) system on only 10% of manually labeled text-line data from a
dataset and propose an incremental training procedure that covers the rest of
the data. Performance is further increased by augmenting the training set with
specially crafted multiscale data. We also propose a model-based normalization
scheme which considers the variability in the writing scale at the recognition
phase. We apply this approach to the publicly available READ dataset. Our
system achieved the second best result during the ICDAR2017 competition
- …