10 research outputs found

    Neue Frakturmodelle fĂĽr Tesseract

    Full text link
    Es werden neue Modelle fĂĽr die verbesserte Erkennung historischer Schriften mit Tesseract OCR vorgestellt

    State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

    Full text link
    In this paper we evaluate Optical Character Recognition (OCR) of 19th century Fraktur scripts without book-specific training using mixed models, i.e. models trained to recognize a variety of fonts and typesets from previously unseen sources. We describe the training process leading to strong mixed OCR models and compare them to freely available models of the popular open source engines OCRopus and Tesseract as well as the commercial state of the art system ABBYY. For evaluation, we use a varied collection of unseen data from books, journals, and a dictionary from the 19th century. The experiments show that training mixed models with real data is superior to training with synthetic data and that the novel OCR engine Calamari outperforms the other engines considerably, on average reducing ABBYYs character error rate (CER) by over 70%, resulting in an average CER below 1%.Comment: Submitted to DHd 2019 (https://dhd2019.org/) which demands a... creative... submission format. Consequently, some captions might look weird and some links aren't clickable. Extended version with more technical details and some fixes to follo

    Ground Truth erstellen, OCR-Modelle verbessern

    Get PDF
    Der Vortrag beschreibt anhand konkreter Beispiele, wie durch Training von künstlichen neuronalen Netzen automatisierte Texterkennung für historische Drucke bestmögliche Ergebnisse liefern kann

    ICDAR 2019 Competition on Post-OCR Text Correction

    Get PDF
    International audienceThis paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The present challenge consists of two tasks: 1) error detection and 2) error correction. An original dataset of 22M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to training and 20% to evaluation. Different sources were aggregated and contain newspapers, historical printed documents as well as manuscripts and shopping receipts, covering 10 European languages (Bulgarian, Czech, Dutch, English, Finish, French, German, Polish, Spanish and Slovak). Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%. This competition, which counted 34 registrations, illustrates the strong interest of the community to improve OCR output, which is a key issue to any digitization process involving textual data

    Advances and Limitations in Open Source Arabic-Script OCR: A Case Study

    Get PDF
    This work presents an accuracy study of the open source OCR engine, Kraken, on the leading Arabic scholarly journal, al-Abhath. In contrast with other commercially available OCR engines, Kraken is shown to be capable of producing highly accurate Arabic-script OCR. The study also assesses the relative accuracy of typeface-specific and generalized models on the al-Abhath data and provides a microanalysis of the “error instances” and the contextual features that may have contributed to OCR misrecognition. Building on this analysis, the paper argues that Arabic-script OCR can be significantly improved through (1) a more systematic approach to training data production, and (2) the development of key technological components, especially multi-language models and improved line segmentation and layout analysis. / Cet article présente une étude d’exactitude du moteur ROC open source, Krakan, sur la revue académique arabe de premier rang, al-Abhath. Contrairement à d’autres moteurs ROC disponibles sur le marché, Kraken se révèle être capable de produire de la ROC extrêmement exacte de l’écriture arabe. L’étude évalue aussi l’exactitude relative des modèles spécifiquement configurés à des polices et celle des modèles généralisés sur les données d’al-Abhath et fournit une microanalyse des « occurrences d’erreurs », ainsi qu’une microanalyse des éléments contextuels qui pourraient avoir contribué à la méreconnaissance ROC. S’appuyant sur cette analyse, cet article fait valoir que la ROC de l’écriture arabe peut être considérablement améliorée grâce à (1) une approche plus systématique d’entraînement de la production de données et (2) grâce au développement de composants technologiques fondamentaux, notammentl’amélioration des modèles multilingues, de la segmentation de ligne et de l’analyse de la mise en page. Kiessling, Benjamin, Gennady Kurin, Matthew Miller, and Kader Smail. 2021. “Advances and Limitations in Open Source Arabic-Script OCR: A Case Study.

    Qualität in der Inhaltserschließung

    Get PDF
    This edited volume deals with issues relating to the quality of subject cataloging in the digital age, where heterogenous articles from different processes meet, and attempts to define important quality standards. Topics range from metadata and the cataloging policies of the German National Library, the GND, and the head offices of the German library association, to the presentation of a range of different projects, such as QURATOR and SoNAR