Search CORE

9 research outputs found

Practical Digital Curation Skills for Archivists in the 21st Century

The University of Maryland's Digital Curation Innovation Center (DCIC) in the College of Information Studies promotes research and education in digital curation and fosters interdisciplinary partnerships using Big Records and archival analytics through public, industry, and government partnerships. In this session, technologists, archivists, and students from the DCIC will provide case studies involving archival records in a digital environment that focus on redlining in urban neighborhoods, the Japanese American experience in World War II internment camps, and navigating Holocaust-era digital data

Digital Repository at the University of Maryland

USA Olympic Medal Winners 2000 - 2012

Author: Rajesh kumar Gnanasekaran (698905)
Publication venue
Publication date
Field of study

<p>This data set contains the list of Olympic medal winners from 2000 till 2012 for countries United States of America. This is a subset of the original data set downloaded from Tableau Software.</p

FigShare

Using Transfer Learning to contextually Optimize Optical Character Recognition (OCR) output and perform new Feature Extraction on a digitized cultural and historical dataset

Author: Gnanasekaran Rajesh Kumar
Inbasekaran Aravind
Marciano Richard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Understanding handwritten and printed text is easier for humans but computers do not have the same level of accuracy. While there are many Optical Character Recognition (OCR) tools like PyTesseract1, Abbyy FineReader2 which extract the text as digital characters from handwritten or printed text images, none of them are without unrecognizable characters or misspelled words. Spelling correction is one of the well-known tasks in Natural Language Processing. Spelling correction of an individual word could be performed through existing tools, however, correcting a word based on the context of the sentence is a challenging task that requires a human-level understanding of the language. In this paper, we introduce a novel experiment of applying Natural Language Processing using a machine learning concept called Transfer Learning3 on the text extracted by OCR tools, thereby optimizing the output text by reducing misspelled words. This experiment is conducted on the OCR output of a sample of newspaper images published between the late 18th century to 19th century. These images were obtained from the Maryland State Archives4 digital archives project named, the Legacy of Slavery5. This Natural Language Processing approach uses pre-trained language transformer models like BERT6 and RoBERTa7 which are used as word-prediction software for spelling correction based on the context of the words in the OCR output. We compare the performance of BERT and RoBERTa on two OCR tool outputs, namely PyTesseract and Abbyy FineReader. A comparative evaluation shows that both the models work fairly well on correcting misspelled words considering the irregularities in the text data from the OCR output. Additionally, with the Transfer Learning output text, a special process is conducted to create a new feature that originally did not exist in the original dataset dataset using Spacy\u27s Entity Recognizer (ER)8. This new extracted values are added to the dataset as a new feature. Also, an existing feature\u27s values are compared to Spacy\u27s ER output and the original hand transcribed data

Chalmers Research