12 research outputs found

    Acquisition plan for Digital Document Storage (DDS) prototype system

    Get PDF
    NASA Headquarters maintains a continuing interest in and commitment to exploring the use of new technology to support productivity improvements in meeting service requirements tasked to the NASA Scientific and Technical Information (STI) Facility, and to support cost effective approaches to the development and delivery of enhanced levels of service provided by the STI Facility. The DDS project has been pursued with this interest and commitment in mind. It is believed that DDS will provide improved archival blowback quality and service for ad hoc requests for paper copies of documents archived and serviced centrally at the STI Facility. It will also develop an operating capability to scan, digitize, store, and reproduce paper copies of 5000 NASA technical reports archived annually at the STI Facility and serviced to the user community. Additionally, it will provide NASA Headquarters and field installations with on-demand, remote, electronic retrieval of digitized, bilevel, bit mapped report images along with branched, nonsequential retrieval of report subparts

    Wooster Magazine: Winter 2018

    Get PDF
    The winter 2018 issue of Wooster magazine examines the different types of data and tools used to answer research questions through the experiences of alumni who curate differnt types of collections. Alumni Kathy Bremar Hollis ’03, John Sime ’09, Erica Clites ’06, and Kristina (Brady) Shannon ’03 share their experiences as curators of scientific collections. Similarly, Wooster biology professors Rick Lehtinen and Jennifer Ison talk about the biological collections on campus. Also featured is alumna Connie Schulz ’64 and her use of digitization to improve our understanding of the past and its people through the digital publication of The Papers of Eliza Lucas Pinckney and Harriott Pinckney Horry with the University of Virginia Press. Wooster English Professor Jennifer Hayward explains how her students are using similar digital curation techniques to study British Chilean Newspapers from the late 19th and early 20th centuries. Also featured is Professor Olivia Navarro-Farr as co-principal investigator for the U.S.-Guatemalan El Perú-Waka’ Archaeological Project.https://openworks.wooster.edu/wooalumnimag_2011-present/1004/thumbnail.jp

    Assessment of OCR Quality and Font Identification in Historical Documents

    Get PDF
    Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. To improve the OCR output, in this thesis we develop machine-learning methods to assess the quality of historical documents and label/tag documents (with the page problems) in the EEBO/ECCO collections—45 million pages available through the Early Modern OCR Project at Texas A&M University. We present an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases. This thesis also aims at generating font metadata for historical documents. Knowledge of the font can aid OCR system to produce very accurate text transcriptions, but getting font information for 45 million documents is a daunting task. We present an active learning based font identification system that can classify document images into fonts. In active learning, a learner queries the human for labels on examples it finds most informative. We capture the characteristics of the fonts using word image features related to character width, angled strokes, and Zernike moments. To extract page level features, we use bag-of-word feature (BoF) model. A font classification model trained using BoF and active learning requires only 443 labeled instances to achieve 89.3% test accuracy

    Semantics-enriched workflow creation and management system with an application to document image analysis and recognition

    Get PDF
    Scientific workflow systems are an established means to model and execute experiments or processing pipelines. Nevertheless, designing workflows can be a daunting task for users due to the complexities of the systems and the sheer number of available processing nodes, each having different compatibility/applicability characteristics. This Thesis explores how concepts of the Semantic Web can be used to augment workflow systems in order to assist researchers as well as non-expert users in creating valid and effective workflows. A prototype workflow creation/management system has been developed, including components for ontology modelling, workflow composition, and workflow repositories. Semantics are incorporated as a lightweight layer, permeating all aspects of the system and workflows, including retrieval, composition, and validation. Document image analysis and recognition is used as a representative application domain to evaluate the validity of the system. A new semantic model is proposed, covering a wide range of aspects of the target domain and adjacent fields. Real-world use cases demonstrate the assistive features and the automated workflow creation. On that basis, the prototype workflow creation/management system is compared to other state-of-the-art workflow systems and it is shown how those could benefit from the semantic model. The Thesis concludes with a discussion on how a complete infrastructure based on semantics-enriched datasets, workflow systems, and sharing platforms could represent the next step in automation within document image analysis and other domains

    Scene text localization and recognition in images and videos

    Get PDF
    Scene Text Localization and Recognition methods nd all areas in an image or a video that would be considered as text by a human, mark boundaries of the areas and output a sequence of characters associated with its content. They are used to process images and videos taken by a digital camera or a mobile phone and to \read" the content of each text area into a digital format, typically a list of Unicode character sequences, that can be processed in further applications. Three di erent methods for Scene Text Localization and Recognition were proposed in the course of the research, each one advancing the state of the art and improving the accuracy. The rst method detects individual characters as Extremal Regions (ER), where the probability of each ER being a character is estimated using novel features with O(1) complexity and only ERs with locally maximal probability are selected across several image projections for the second stage, where the classi cation is improved using more computationally expensive features. The method was the rst published method to address the complete problem of scene text localization and recognition as a whole - all previous work in the literature focused solely on di erent subproblems. Secondly, a novel easy-to-implement stroke detector was proposed. The detector is signi cantly faster and produces signi cantly less false detections than the commonly used ER detector. The detector e ciently produces character strokes segmentations, which are exploited in a subsequent classi cation phase based on features e ectively calculated as part of the segmentation process. Additionally, an e cient text clustering algorithm based on text direction voting is proposed, which as well as the previous stages is scale- and rotation- invariant and supports wide variety of scripts and fonts. The third method exploits a deep-learning model, which is trained for both text detection and recognition in a single trainable pipeline. The method localizes and recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic data so it does not require obtaining expensive human annotations for training and it achieves state-of-the-art accuracy in the end-to-end text recognition on two standard datasets, whilst being an order of magnitude faster than the previous methods - the whole pipeline runs at 10 frames per second.Katedra kybernetik

    Electronic Libraries Programme: Synthesis of 1997 Project Annual Reports

    Get PDF
    Belgium Herbarium image of Meise Botanic Garden

    Improving Digital Library Support for Historic Newspaper Collections

    Get PDF
    DVD-ROM Appendix available with the print copy of this thesis.National and international initiatives are underway around the globe to digitise the vast treasure troves of historical artefacts they contain and make them available as digital libraries (DLs). The developed DLs are often constructed from facsimile pages with pre-existing metadata, such as historic newspapers stored on microfiche or generated from the non-destructive scanning of precious manuscripts. Access to the source documents is therefore limited to methods constructed from the metadata. Other projects look to introduce full-text indexing through the application of off-the-shelf commercial Optical Character Recognition (OCR) software. While this has greater potential for the end user experience over the metadata-only versions, the approach currently taken is best effort in the time available rather than a process informed by detailed analysis of the issues. In this thesis, we investigate if a richer level of support and service can be achieved by more closely integrating image processing techniques with DL software. The thesis presents a variety of experiments, implemented within the recently published open-source OCR System (Ocropus). In particular, existing segmentation algorithms are compared against our own based on Hough Transform, using our own created corpus gathered from different major online digital historic newspaper archives
    corecore