Search CORE

9 research outputs found

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals – Collected Notes on Quality Improvement

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue: CEUR-WS.org
Publication date: 06/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Optical character recognition with neural networks and post-correction with finite state methods

Author: Drobac Senka
Linden Krister
Publication venue
Publication date: 01/12/2020
Field of study

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wanca in Korp : Text corpora for underresourced Uralic languages

Author: Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Border crossing and trespassing? : Expanding digital humanities research to developing peripheries with the novel digital technologies

Author: Hyyryläinen Torsti
Ryynänen Toni
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Implementación de una plataforma web de almacenamiento y difusión de planos arquitectonicos antiguos usando OCR y Tecnologías WEB

Author: Agramonte Flores Martín
Publication venue: 'Baishideng Publishing Group Inc.'
Publication date: 09/06/2022
Field of study

El objetivo del desarrollo de la presente tesis es de integrar la tecnología OCR y las tecnologías web en una sola plataforma web que permita el almacenamiento y la difusión de planos arquitectonicos antiguos, el registro de los planos actualmente es realizado de forma manual mediante fichas (ver anexo 1), por lo que la plataforma web que se desarrolló, permitirá un respaldo más seguro y consultas más rápidas. Para el desarrollo general se usó la metodología ágil SCRUM que permitió generar entregables funcionales cada vez más completos a medida que se avanzaba con los requerimientos. Al optar por tecnologías web, los lenguajes de programación que se usaron fueron php, javascript, html; finalmente se optó por el framework de desarrollo Laravel pues proporcionaba una arquitectura MVC e integraba de manera sencilla el OCR. La tecnología OCR se seleccionó mediante el análisis con pruebas ‘t’ entre dos tecnologías como son Tesseract OCR y OCR Space Api, teniendo como resultado final la selección de la tecnología OCR Space Api, la cual conto con un porcentaje de aciertos del 80.28 %. frente al 71.58 %. de la tecnología Tesseract. Estando el resultado de aciertos dentro de un rango aceptable, la tecnología OCR Tesseract fue descartada pues al ser un conjunto de librerías, dependía de la velocidad de procesamiento del ordenador en donde se va a usar y que a diferencia del OCR Space que, al ser una API, la mayoría de procesos para el escaneo de una imagen se encuentran en la nube, siendo su única limitación es la cantidad de consultas que se realizan por día. Finalmente se obtuvo como resultado una plataforma web que usa la tecnología OCR para poder autocompletar algunos campos de texto mediante la selección de la información en un plano escaneado; asimismo cumple con el objetivo de poder almacenar dicha información para posteriormente ser consultada por cualquier persona registrada mediante una interfaz responsiva y amigable para el usuario

Repositorio de Tesis - Universidad Católica de Santa María

Proceedings of the Research Data And Humanities (RDHUM) 2019 Conference: Data, Methods And Tools

Author: Ali Zeeshan Ijaz
Iiro Tiihonen
Leo Lahti
Mikko Tolonen
Publication venue: Suomen kasvatuksen ja koulutuksen historian seura
Publication date: 28/10/2022
Field of study

Analytical bibliography aims to understand the production of books. Systematic methods can be used to determine an overall view of the publication history. In this paper, we present the state of the art analytical approach towards the determination of editions using the ESTC meta data. The preliminary results illustrate that metadata cleanup and analysis can provide opportunities for edition determination. This would significantly help projects aiming to do large scale text mining.</p

UTUPub