5 research outputs found

    OCR Processing of Swedish Historical Newspapers Using Deep Hybrid CNN-LSTM Networks

    No full text
    Deep CNN-LSTM hybrid neural networks have proven to improve the accuracy of Optical Character Recognition (OCR) models for different languages. In this paper we examine to what extent these networks improve the OCR accuracy rates on Swedish historical newspapers. By experimenting with the open source OCR engine Calamari, we are able to show that mixed deep CNN-LSTM hybrid models outperform previous models on the task of character recognition of Swedish historical newspapers spanning 1818-1848. We achieved an average character accuracy rate (CAR) of 97.43% which is a new state-of-the-art result on 19th century Swedish newspaper text

    Verb Morphology of Hebrew and Maltese - Towards an Open Source Type Theoretical Resource Grammar in GF.

    No full text
    One of the first issues that a programmer must tackle when writing a complete computer program that processes natural language ishow to design the morphological component. A typical morphological component should cover three main aspects in a given language:(1) the lexicon, i.e. how morphemes are encoded, (2) orthographic changes, and (3) morphotactic variations. This is in particularchallenging when dealing with Semitic languages because of their non-concatenative morphology called root-and pattern morphology.In this paper we describe the design of two morphological components for Hebrew and Maltese verbs in the context of the GrammaticalFramework (GF). The components are implemented as a part of larger grammars and are currently under development. We found thatalthough Hebrew and Maltese share some common characteristics in their morphology, it seems difficult to generalize morphosyntacticrules across Semitic verbs when the focus is towards computational linguistics motivated lexicons. We describe and compare the verbmorphology of Hebrew and Maltese and motivate our implementation efforts towards a complete open source type theoretical resourcegrammars for Semitic languages. Future work will focus on semantic aspects of morphological processing

    A Novel Machine Learning Based Approach for Post-OCR Error Detection

    No full text
    Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition (OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84

    High-quality translation: Molto tools and applications

    No full text
    MOLTO (Multilingual On Line Translation, FP7-ICT-247914, www.molto-project.eu) is a European project focusing on translation on the web. MOLTO targets translation that has production quality, that is, usable for quick and reliable dissemination of information. MOLTO’s main focus is to increase the productivity of such translation systems, building on the technology of GF (Grammatical Framework) and its Resource Grammar Library. But MOLTO also develops hybrid methods which increasethe quality of Statistical Machine Translation (SMT) by adding linguistic information, or bootstrap grammatical models from statistical models. This paper gives a brief overview of MOLTO’s latest achievements, many of which are more thoroughly described in separate papers and available as web-based demos and as open-source software

    High-quality translation: Molto tools and applications

    No full text
    MOLTO (Multilingual On Line Translation, FP7-ICT-247914, www.molto-project.eu) is a European project focusing on translation on the web. MOLTO targets translation that has production quality, that is, usable for quick and reliable dissemination of information. MOLTO’s main focus is to increase the productivity of such translation systems, building on the technology of GF (Grammatical Framework) and its Resource Grammar Library. But MOLTO also develops hybrid methods which increasethe quality of Statistical Machine Translation (SMT) by adding linguistic information, or bootstrap grammatical models from statistical models. This paper gives a brief overview of MOLTO’s latest achievements, many of which are more thoroughly described in separate papers and available as web-based demos and as open-source software
    corecore