2,439 research outputs found

    A modular methodology for converting large, complex books into usable, accessible and standards-compliant ebooks

    Get PDF
    This report describes the methodology used for ebook creation for the Glasgow Digital Library (GDL), and provides detailed instructions on how the same methodology could be used elsewhere. The document includes a description and explanation of the processes for ebook creation followed by a tutorial

    Extraction and parsing of herbarium specimen data: Exploring the use of the Dublin core application profile framework

    Get PDF
    Herbaria around the world house millions of plant specimens; botanists and other researchers value these resources as ingredients in biodiversity research. Even when the specimen sheets are digitized and made available online, the critical information about the specimen stored on the sheet are not in a usable (i.e., machine-processible) form. This paper describes a current research and development project that is designing and testing high-throughput workflows that combine machine- and human-processes to extract and parse the specimen label data. The primary focus of the paper is the metadata needs for the workflow and the creation of the structured metadata records describing the plant specimen. In the project, we are exploring the use of the new Dublin Core Metadata Initiative framework for application profiles. First articulated as the Singapore Framework for Dublin Core Application Profiles in 2007, the use of this framework is in its infancy. The promises of this framework for maximum interoperability and for documenting the use of metadata for maximum reusability, and for supporting metadata applications that are in conformance with Web architectural principles provide the incentive to explore and add implementation experience regarding this new framework

    OCRspell: An interactive spelling correction system for OCR errors in text

    Full text link
    In this thesis we describe a spelling correction system designed specifically for OCR (Optical Character Recognition) generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well

    Transcription enhancement of a digitised multi-lingual pamphlet collection: a case study and guide for similar projects

    Get PDF
    UCL Library Services holds an extensive collection of over 9,000 Jewish pamphlets, many of these extremely rare. Over the past five years, UCL has embarked on a project to widen access to this collection through an extensive programme of cataloguing, conservation and digitisation. With the cataloguing complete and the most fragile items conserved, the focus is now on making these texts available to global audiences via UCL Digital Collections website. The pamphlets were ranked for rarity, significance and fragility and the highest-scoring selected for digitisation. Unique identifiers allocated at the point of cataloguing were used to track individual pamphlets through the stages of the project. This guide details the text-enhancement methods used, highlighting particular issues relating to Hebrew scripts and early-printed texts. Initial attempts to enable images of these pamphlets to be searched digitally relied on the Optical Character Recognition (OCR) embedded within the software used to create the PDF files. Whilst satisfactory for texts chiefly in Roman script, it provided no reliable means to search the extensive corpus of texts in Hebrew. Generous advice offered by the National Library of Israel led to our adoption of ABBYY FineReader software as a means of enhancing the transcriptions embedded within the PDF files. Following image capture, JPEG files were used to create multi-page PDF files of each pamphlet. Pre-processing in ABBYY FineReader consisted of: setting the language and colour mode; detecting page orientation; selecting and refining areas of the text to be read; reading the text to produce a transcription. The resultant files were stored in folders according to language of text. The software highlighted spelling errors and doubtful readings. A verification tool allowed transcribers to correct these as required. However, some erroneous or doubtful readings were nevertheless genuine words and not highlighted; it was therefore essential to proofread the text, particularly for early-printed scripts. Transcribers maintained logs of common errors; additionally, problems with Hebrew vocalisations, cursive and Gothic scripts were noted. During initial quality checks of the transcriptions, many text searches were unsuccessful due to previously unidentified spacings occurring within words. This was generally linked to the font size being too small. Maintaining logs of font sizes used led to the adoption of a minimum of Arial 8 or Times New Roman 10 in transcribed text. The methodology was revised to include the preliminary quality-checking of one page. We concluded that it was difficult to develop a standardised procedure applicable to all texts given the variance in language, script and typography. However, we concluded that the font Arial gave the most successful accuracy ratings for Hebrew script, minimum text size 17, minimum title size 25. ABBYY file preparation took a minimum of 1.5 hours per pamphlet; transcription correction took an average of 10.4 minutes per page; the final quality check took 30 minutes per pamphlet. On average, the work on each pamphlet took a minimum of 6 hours to complete. As a result of the project, average accuracy ratings improved from 60% to 89%, the greatest improvement being for pre-1800 and Hebrew script publications. We are therefore inclined to focus future transcription-enhancement activity on these types of publication for the remainder of our Jewish Pamphlet Collections

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

    The Wiltshire Wills Feasibility Study

    Get PDF
    The Wiltshire and Swindon Record Office has nearly ninety thousand wills in its care. These records are neither adequately catalogued nor secured against loss by facsimile microfilm copies. With support from the Heritage Lottery Fund the Record Office has begun to produce suitable finding aids for the material. Beginning with this feasibility study the Record Office is developing a strategy to ensure the that facsimiles to protect the collection against risk of loss or damage and to improve public access are created.<p></p> This feasibility study explores the different methodologies that can be used to assist the preservation and conservation of the collection and improve public access to it. The study aims to produce a strategy that will enable the Record Office to create digital facsimiles of the Wills in its care for access purposes and to also create preservation quality microfilms. The strategy aims to seek the most cost effective and time efficient approach to the problem and identifies ways to optimise the processes by drawing on the experience of other similar projects. This report provides a set of guidelines and recommendations to ensure the best use of the resources available for to provide the most robust preservation strategy and to ensure that future access to the Wills as an information resource can be flexible, both local and remote, and sustainable

    Thesaurus-aided learning for rule-based categorization of Ocr texts

    Full text link
    The question posed in this thesis is whether the effectiveness of the rule-based approach to automatic text categorization on OCR collections can be improved by using domain-specific thesauri. A rule-based categorizer was constructed consisting of a C++ program called C-KANT which consults documents and creates a program which can be executed by the CLIPS expert system shell. A series of tests using domain-specific thesauri revealed that a query expansion approach to rule-based automatic text categorization using domain-dependent thesauri will not improve the categorization of OCR texts. Although some improvement to categorization could be made using rules over a mixture of thesauri, the improvements were not significantly large

    Mining Images in Biomedical Publications: Detection and Analysis of Gel Diagrams

    Get PDF
    Authors of biomedical publications use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a concise way to communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for automated image mining and parsing. We introduce an approach for the detection of gel images, and present a workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present preliminary results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.Comment: arXiv admin note: substantial text overlap with arXiv:1209.148
    • …
    corecore