12 research outputs found
Acquisition plan for Digital Document Storage (DDS) prototype system
NASA Headquarters maintains a continuing interest in and commitment to exploring the use of new technology to support productivity improvements in meeting service requirements tasked to the NASA Scientific and Technical Information (STI) Facility, and to support cost effective approaches to the development and delivery of enhanced levels of service provided by the STI Facility. The DDS project has been pursued with this interest and commitment in mind. It is believed that DDS will provide improved archival blowback quality and service for ad hoc requests for paper copies of documents archived and serviced centrally at the STI Facility. It will also develop an operating capability to scan, digitize, store, and reproduce paper copies of 5000 NASA technical reports archived annually at the STI Facility and serviced to the user community. Additionally, it will provide NASA Headquarters and field installations with on-demand, remote, electronic retrieval of digitized, bilevel, bit mapped report images along with branched, nonsequential retrieval of report subparts
Wooster Magazine: Winter 2018
The winter 2018 issue of Wooster magazine examines the different types of data and tools used to answer research questions through the experiences of alumni who curate differnt types of collections. Alumni Kathy Bremar Hollis ’03, John Sime ’09, Erica Clites ’06, and Kristina (Brady) Shannon ’03 share their experiences as curators of scientific collections. Similarly, Wooster biology professors Rick Lehtinen and Jennifer Ison talk about the biological collections on campus. Also featured is alumna Connie Schulz ’64 and her use of digitization to improve our understanding of the past and its people through the digital publication of The Papers of Eliza Lucas Pinckney and Harriott Pinckney Horry with the University of Virginia Press. Wooster English Professor Jennifer Hayward explains how her students are using similar digital curation techniques to study British Chilean Newspapers from the late 19th and early 20th centuries. Also featured is Professor Olivia Navarro-Farr as co-principal investigator for the U.S.-Guatemalan El Perú-Waka’ Archaeological Project.https://openworks.wooster.edu/wooalumnimag_2011-present/1004/thumbnail.jp
Assessment of OCR Quality and Font Identification in Historical Documents
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. To improve the OCR output, in this thesis we develop machine-learning methods to assess the quality of historical documents and label/tag documents (with the page problems) in the EEBO/ECCO collections—45 million pages available through the Early Modern OCR Project at Texas A&M University.
We present an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases.
This thesis also aims at generating font metadata for historical documents. Knowledge of the font can aid OCR system to produce very accurate text transcriptions, but getting font information for 45 million documents is a daunting task. We present an active learning based font identification system that can classify document images into fonts. In active learning, a learner queries the human for labels on examples it finds most informative. We capture the characteristics of the fonts using word image features related to character width, angled strokes, and Zernike moments. To extract page level features, we use bag-of-word feature (BoF) model. A font classification model trained using BoF and active learning requires only 443 labeled instances to achieve 89.3% test accuracy
Semantics-enriched workflow creation and management system with an application to document image analysis and recognition
Scientific workflow systems are an established means to model and execute experiments or processing pipelines. Nevertheless, designing workflows can be a daunting task for users due to the complexities of the systems and the sheer number of available processing nodes, each having different compatibility/applicability characteristics. This Thesis explores how concepts of the Semantic Web can be used to augment workflow systems in order to assist researchers as well as non-expert users in creating valid and effective workflows. A prototype workflow creation/management system has been developed, including components for ontology modelling, workflow composition, and workflow repositories. Semantics are incorporated as a lightweight layer, permeating all aspects of the system and workflows, including retrieval, composition, and validation. Document image analysis and recognition is used as a representative application domain to evaluate the validity of the system. A new semantic model is proposed, covering a wide range of aspects of the target domain and adjacent fields. Real-world use cases demonstrate the assistive features and the automated workflow creation. On that basis, the prototype workflow creation/management system is compared to other state-of-the-art workflow systems and it is shown how those could benefit from the semantic model. The Thesis concludes with a discussion on how a complete infrastructure based on semantics-enriched datasets, workflow systems, and sharing platforms could represent the next step in automation within document image analysis and other domains
Scene text localization and recognition in images and videos
Scene Text Localization and Recognition methods nd all areas in an image or a video
that would be considered as text by a human, mark boundaries of the areas and output
a sequence of characters associated with its content. They are used to process images
and videos taken by a digital camera or a mobile phone and to \read" the content of
each text area into a digital format, typically a list of Unicode character sequences, that
can be processed in further applications.
Three di erent methods for Scene Text Localization and Recognition were proposed
in the course of the research, each one advancing the state of the art and improving the
accuracy. The rst method detects individual characters as Extremal Regions (ER),
where the probability of each ER being a character is estimated using novel features
with O(1) complexity and only ERs with locally maximal probability are selected across
several image projections for the second stage, where the classi cation is improved using
more computationally expensive features. The method was the rst published method
to address the complete problem of scene text localization and recognition as a whole
- all previous work in the literature focused solely on di erent subproblems.
Secondly, a novel easy-to-implement stroke detector was proposed. The detector is
signi cantly faster and produces signi cantly less false detections than the commonly
used ER detector. The detector e ciently produces character strokes segmentations,
which are exploited in a subsequent classi cation phase based on features e ectively
calculated as part of the segmentation process. Additionally, an e cient text clustering
algorithm based on text direction voting is proposed, which as well as the previous
stages is scale- and rotation- invariant and supports wide variety of scripts and fonts.
The third method exploits a deep-learning model, which is trained for both text
detection and recognition in a single trainable pipeline. The method localizes and
recognizes text in an image in a single feed-forward pass, it is trained purely on synthetic
data so it does not require obtaining expensive human annotations for training and it
achieves state-of-the-art accuracy in the end-to-end text recognition on two standard
datasets, whilst being an order of magnitude faster than the previous methods - the
whole pipeline runs at 10 frames per second.Katedra kybernetik
Electronic Libraries Programme: Synthesis of 1997 Project Annual Reports
Belgium Herbarium image of Meise Botanic Garden
Improving Digital Library Support for Historic Newspaper Collections
DVD-ROM Appendix available with the print copy of this thesis.National and international initiatives are underway around the globe to digitise the vast treasure troves of historical artefacts they contain and make them available as digital libraries (DLs). The developed DLs are often constructed from facsimile pages with pre-existing metadata, such as historic newspapers stored on microfiche or generated from the non-destructive scanning of precious manuscripts. Access to the source documents is therefore limited to methods constructed from the metadata. Other projects look to introduce full-text indexing through the application of off-the-shelf commercial Optical Character Recognition (OCR) software. While this has greater potential for the end user experience over the metadata-only versions, the approach currently taken is best effort in the time available rather than a process informed by detailed analysis of the issues. In this thesis, we investigate if a richer level of support and service can be achieved by more closely integrating image processing techniques with DL software.
The thesis presents a variety of experiments, implemented within the recently published open-source OCR System (Ocropus). In particular, existing segmentation algorithms are compared against our own based on Hough Transform, using our own created corpus gathered from different major online digital historic newspaper archives