Search CORE

118 research outputs found

Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment

Author: Maurer Yves
Publication venue: Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden
Publication date: 16/10/2017
Field of study

The user expectation from a digitized collection is that a full text search can be performed and that it will retrieve all the relevant results. The reality is, however, that the errors introduced during Optical Character Recognition (OCR) degrade the results significantly and users do not get what they expect. The National Library of Luxembourg started its digitization program in 2000 and in 2005 started performing OCR on the scanned images. The OCR was always performed by the scanning suppliers, so over the years quite a lot of different OCR programs in different versions have been used. The manual parts of the digitization chain (handling, scanning, zoning, …) are difficult, costly and mostly incompressible, so the library thought that the supplier should focus on a high quality level for these parts. OCR is an automated process and so the library believed that the text recognized by the OCR could be improved automatically since OCR software improves over the years. This is why the library has never asked the supplier for a minimum recognition rate. The author is proposing to test this assumption by first evaluating the base quality of the text extracted by the original supplier, followed by running a contemporary OCR program and finally comparing its quality to the first extraction. The corpus used is the collection of digitized newspapers from Luxembourg, published from the 18th century to the 20th century. A complicating element is that the corpus consists of three main languages, German, French and Luxembourgish, which are often present on a single newspaper page together. A preliminary step is hence added to detect the language used in a block of text so that the correct dictionaries and OCR engines can be used

Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden (SLUB): Qucosa

Recommended from our members

Making digital history: The impact of digitality on public participation and scholarly practices in historical research

Author: Ridge Mia
Publication venue
Publication date: 29/06/2016
Field of study

This thesis investigates tow key questions: firstly, how do two broad groups - academic, family and local historians, and the public - evaluate, use, and contribute to digital history resources? And consequently, what impact have digital technologies had on public participation and scholarly practices in historical research? Analysing the impact of design on participant experiences and the reception of digital historiography by demonstrating the value of methods drawn from human-computer interaction, including heuristic evaluation, trace ethnography and semi-structured interviews. This thesis also investigates the relationship between heritage crowdsourcing projects (which ask the public to help with meaningful, inherently rewarding tasks that contribute to a shared, significant goal or research interest related to cultural heritage collections or knowledge) and the development of historical skills and interests. It situates crowdsourcing and citizen history within the broader field of participatory digital history and then focuses on the impact of digitality on the research practices of faculty and community historians. Chapter 1 provides an overview of over 400 digital history projects aimed at engaging the public or collecting, creating or enhancing records about historical materials for scholarly and general audiences. Chapter 2 discusses design factors that may influence the success of crowdsourcing projects. Following this, Chapter 3 explores the ways in which some crowdsourcing projects encourage deeper engagement with history or science, and the role of communities of practice in citizen history. Chapter 4 shifts our focus from public participation to scholarly practices in historical research, presenting the results of interviews conducted with 29 faculty and community historians. Finally, the Conclusion draws together the threads that link public participation and scholarly practices, teasing out the ways in which the practices of discovering, gathering, creating and sharing historical materials and knowledge have been affected by digital methods, tools and resources

Open Research Online (The Open University)

Flexible Techniques for Automatic Text Recognition of Historical Documents

Author: Ströbel Phillip
Publication venue
Publication date: 01/01/2023
Field of study

ZORA

Cultural Heritage on line

Author
Publication venue: 'Firenze University Press'
Publication date: 31/05/2022
Field of study

The 2nd International Conference "Cultural Heritage online – Empowering users: an active role for user communities" was held in Florence on 15-16 December 2009. It was organised by the Fondazione Rinascimento Digitale, the Italian Ministry for Cultural Heritage and Activities and the Library of Congress, through the National Digital Information Infrastructure and Preservation Program - NDIIP partners. The conference topics were related to digital libraries, digital preservation and the changing paradigms, focussing on user needs and expectations, analysing how to involve users and the cultural heritage community in creating and sharing digital resources. The sessions investigated also new organisational issues and roles, and cultural and economic limits from an international perspective

Directory of Open Access Books (DOAB)

CronoClock: A Multiagent Mobile Model to Assist Drivers in Park Zones

Author: Mohamad Mohd Saberi
Navarro Cáceres María
Prieta Pintado Fernando de la
Villarrubia González Gabriel
Publication venue: 'Ediciones Universidad de Salamanca'
Publication date: 01/01/2014
Field of study

We present a model based on multiagent environment to help users of park zones. A multiagent system is built to communicate the different roles and devices belonging to the system. The result is a model that assists the different users in their payments and identifies the vehicle plate by image recognition and NFC (Near Field Communication) technology. An evaluation of the system is provided to know the efficiency of the model

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gestion del Repositorio Documental de la Universidad de Salamanca

Digital Histories

Author
Publication venue: 'Helsinki University Press'
Publication date
Field of study

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship

OAPEN Library

Advanced document data extraction techniques to improve supply chain performance

Author: Sharma Vikash
Publication venue
Publication date: 01/07/2021
Field of study

In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

Repository@Hull - Worktribe

Corpus linguistics for History:the methodology of investigating place-name discourses in digitised nineteenth-century newspapers

Author: Joulain Amelia Tahirih
Publication venue: Lancaster University
Publication date: 01/01/2017
Field of study

The increasing availability of historical sources in a digital form has led to calls for new forms of reading in history. This thesis responds to these calls by exploring the potential of approaches from the field of corpus linguistics to be useful to historical research. Specifically, two sets of methodological issues are considered that arise when corpus linguistic methods are used on digitised historical sources. The first set of issues surrounds optical character recognition (OCR), computerised text transcription based on image reproduction of the original printed source. This process is error-prone, which leads to potentially unreliable word-counts. I find that OCR errors are very varied, and more different from their corrections than natural spelling variation from a standard form. As a result of OCR errors, the test OCR corpus examined has a slightly inflated overall token count (as compared to a hand-corrected gold standard), and a vastly inflated type count. Not all spurious types are infrequent: around 7% of types occurring at least 10 times in my test OCR corpus are spurious. I also find evidence that real-word errors occur. Assessing the impact of OCR errors on two common collocation statistics, Mutual Information (MI) and Log-Likelihood (LL), I find that both are affected by OCR errors. This analysis also provides evidence that OCR errors are not homogenously distributed throughout the corpus. Nevertheless, for small collocation spans, MI rankings are broadly reliable in OCR data, especially when used in combination with an LL threshold. Large spans are best avoided, as both statistics become increasingly less reliable in OCR data, when used with larger spans. Both statistics attract non-negligible rates of false positives. Using a frequency floor will eliminate many OCR errors, but does not reduce the rates of MI and LL false positives. Assessing the potential of two post-OCR correction methods, I find that VARD, a program designed to standardise natural spelling variation, proves unpromising for dealing with OCR errors. By contrast, Overproof, a commercial system designed for OCR errors, is effective, and its application leads to substantial improvements in the reliability of MI and LL, particularly for large spans. The second set of issues relate to the effectiveness of approaches to analysing the discourses surrounding place-names in digitised nineteenth-century newspapers. I single out three approaches to identifying place-names mentioned in large amounts of text without the need for a geo-parser system. The first involves relying on USAS, a semantic tagger, which has a 'Z2' tag for geographic names. This approach cannot identify multi-word place-names, but is scalable. A difficulty is that frequency counts of place-names do not account for their possible polysemy; I suggest a procedure involving reading a random sample of concordance lines for each place-name, in order to obtain an estimate of the actual number of mentions of that place-name in reference to a specific place. This method is best used to identify the most frequent place-names. A second, related, approach is to automatically compare a list of words tagged 'Z2' with a gazetteer, a reference list of place-names. This method, however, suffers from the same difficulties as the previous one, and is best used when accurate frequency counts are not required. A third approach involves starting from a principled, text-external, list of place-names, such as a population table, then attempting to locate each place in the set of texts. The scalability of this method depends on the length of the list of place-names, but it can accommodate any quantity of text. Its advantage over the two other methods is that it helps to contextualise the findings and can help identify place-names which are not mentioned in the texts. Finally, I consider two approaches to investigating the discourses surrounding place-names in large quantities of text. Both are scalable operationalisations of proximity-based collocation. The first approach starts with the whole corpus, searching for the place-name of interest and generating a list of statistical collocates of the place-name; these collocates can then be further categorised and analysed via concordance analysis. The second approach starts with small samples of concordance lines for the place-name of interest, and involves analysing these concordance lines to develop a framework for description of the phraseologies within which place-names are mentioned. Both methods are useful and scalable; the findings they yield are, to some extent, overlapping, but also complementary. This suggests that both methods may be fruitfully used together, albeit neither is ideally-suited for comparing results across corpora. Both approaches are well-suited for exploratory research

Lancaster E-Prints

Measuring Tool Bias and Improving Data Quality for Digital Humanities Research

Author: Traub M.C. (Myriam)
Publication venue
Publication date: 11/05/2020
Field of study

CWI's Institutional Repository