Search CORE

482 research outputs found

Generating a training corpus for OCR post-correction using encoder-decoder model

Author: D'hondt Eva
Grau Brigitte
Grouin Cyril
Publication venue: HAL CCSD
Publication date: 27/11/2017
Field of study

International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain

Speech Recognition and Scholarly Research: Usability and Sustainability

Author: Ordelman Roeland J.F.
van Hessen Adrianus J.
Publication venue
Publication date: 10/10/2018
Field of study

University of Twente Research Information

Media Suite: Unlocking Archives for Mixed Media Scholarly Research

Author: Martínez Ortíz Carlos
Melgar Estrada Liliana
Noordegraaf Julia
Ordelman Roeland J.F.
Publication venue
Publication date: 10/10/2018
Field of study

University of Twente Research Information

24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Author
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

An OCR Post-correction Approach using Deep Learning for Processing Medical Reports

Author: Doctor Faiyaz
Garcia Seco De Herrera Alba
Karthikeyan Srinidhi
Mirza Asim
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

According to a recent Deloitte study, the COVID-19 pandemic continues to place a huge strain on the global health care sector. Covid-19 has also catalysed digital transformation across the sector for improving operational efficiencies. As a result, the amount of digitally stored patient data such as discharge letters, scan images, test results or free text entries by doctors has grown significantly. In 2020, 2314 exabytes of medical data was generated globally. This medical data does not conform to a generic structure and is mostly in the form of unstructured digitally generated or scanned paper documents stored as part of a patient’s medical reports. This unstructured data is digitised using Optical Character Recognition (OCR) process. A key challenge here is that the accuracy of the OCR process varies due to the inability of current OCR engines to correctly transcribe scanned or handwritten documents in which text may be skewed, obscured or illegible. This is compounded by the fact that processed text is comprised of specific medical terminologies that do not necessarily form part of general language lexicons. The proposed work uses a deep neural network based self-supervised pre-training technique: Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) that can learn to predict hidden (masked) sections of text to fill in the gaps of non-transcribable parts of the documents being processed. Evaluating the proposed method on domain-specific datasets which include real medical documents, shows a significantly reduced word error rate demonstrating the effectiveness of the approach

University of Essex Research Repository

Crossref

Automated Identification of Severe Errors in Speech to Text Transcripts

Author: Rosenvinge Frederik Hjelde
Publication venue: The University of Bergen
Publication date: 21/06/2023
Field of study

In this thesis we explore how problematic misplaced words can be automatically identified in speech-to-text-transcripts. Automatic Speech Recognition systems (ASR) are systems that can automatically generate text from human speech. Because natural language spoken by humans is complex, due to dialects, variations in talking speed, and differences in how humans talk compared to the training data, there might be errors introduced by such ASR systems. Sometimes, these errors are so bad that they become problematic. Post-processing of an ASR system means finding such errors after the text has been generated by the system. We want to find out to what degree probabilities of words computed using pre-trained language models can be used to solve this problem, as well as to what degree these probabilities can be used to create a classifier to detect problematic words. We present our solution, where we synthetically introduce problematic words into text documents. Then we compute probabilities of both problematic and non-problematic words in these documents to investigate if they are treated differently by the models. We show that the models generally assign lower probabilities to problematic words and higher probabilities to good words. We train a logistic regression classifier using these probabilities to classify words. Our results show that using probabilities from NorBERT1 and NorBERT2, a logistic regression classifier can accurately detect problematic words. We also show that NB-BERT performs worse than a baseline bigram model.Masteroppgave i informasjonsvitenskapINFO390MASV-INF

University of Bergen

Proceedings of the 2nd IUI Workshop on Interacting with Smart Objects

Author: Brdiczka Oliver
Huber Jochen
Lissermann Roman
Luyten Kris
Mühlhäuser Max
Schnelle-Walka Dirk
Publication venue: 'University of Debrecen/ Debreceni Egyetem'
Publication date: 01/03/2013
Field of study

These are the Proceedings of the 2nd IUI Workshop on Interacting with Smart Objects. Objects that we use in our everyday life are expanding their restricted interaction capabilities and provide functionalities that go far beyond their original functionality. They feature computing capabilities and are thus able to capture information, process and store it and interact with their environments, turning them into smart objects

tuprints

Proceedings of the workshop on language technology for normalisation of less-resourced languages (SaLTMiL 8 - AfLaT 2012)

Author: De Pauw Guy
de Schryver Gilles-Maurice
Forcada Mike L
Sarasola Kepa
Tyers Francis M
Wagacha Peter W
Publication venue: European Language Resources Association
Publication date: 01/01/2012
Field of study

Ghent University Academic Bibliography

OCR Post-processing Using Large Language Models

Author: Hajiali Mahdi
Publication venue: Digital Scholarship@UNLV
Publication date: 15/08/2023
Field of study

Optical Character Recognition (OCR) technology transforms textual visuals into an electronically readable, non-graphical format of the text. This allows the editing and other text manipulation of the content by language technology software such as machine translation, text comprehension, query-answering systems, and search engines. While Optical Character Recognition (OCR) systems continually progress towards greater precision, several complications persist when dealing with low-resolution source images or those with multicolored backgrounds. Consequently, the text derived from OCR necessitates additional refinement to optimize accuracy, beneficial for various subsequent applications. It is recognized that the character accuracy of OCR-generated text may influence certain natural language processing tasks, including Information Retrieval, Named-Entity Recognition, and Sentiment Analysis. Post-processing techniques for Optical Character Recognition (OCR) consist of three fundamental stages of identifying incorrect words, producing a list of potential corrections, and selecting the accurate word from the list to replace the erroneous word. In this work, we are using large language models and word embeddings to detect recognition errors caused by the OCR software. In addition, we use the generative capabilities of these language models to suggest correction candidates to possibly fix the errors. Our work also includes the development of tools that can be used to further improve the OCR post-processing technologies

University of Nevada, Las Vegas Repository