3 research outputs found

    Toward a service-based workflow for automated information extraction from herbarium specimens

    Get PDF
    Over the past years, herbarium collections worldwide have started to digitize millions of specimens on an industrial scale. Although the imaging costs are steadily falling, capturing the accompanying label information is still predominantly done manually and develops into the principal cost factor. In order to streamline the process of capturing herbarium specimen metadata, we specified a formal extensible workflow integrating a wide range of automated specimen image analysis services. We implemented the workflow on the basis of OpenRefine together with a plugin for handling service calls and responses. The evolving system presently covers the generation of optical character recognition (OCR) from specimen images, the identification of regions of interest in images and the extraction of meaningful information items from OCR. These implementations were developed as part of the Deutsche Forschungsgemeinschaft-funded a standardised and optimised process for data acquisition from digital images of herbarium specimens (StanDAP-Herb) Project

    Conversion of Cadastral Survey Information into LandXML Files using Machine Learning

    Get PDF
    Although new cadastral surveys can readily be produced in the industry standard LandXML format, there is a vast amount of pre-existing information which is only stored as image files. Automating the back-capture of this information would improve a process which is labour intensive and prone to human error. This project proposes a workflow to automate this process, in relation to Victorian cadastral survey information. Specific algorithms and outcomes are examined using a simplified sample cadastral plan. The literature review reveals that similar documentation processes have been undertaken in other fields, such as music (Calvo-Zaragoza et al., 2018). In the cadastral context only true to scale cadastral maps have been digitised but not surveyors’ sketches or field records (Ignjatić et al., 2018) A simple plan was created containing a closed parcel and two instrument points for creation and testing of the workflow. An analysis of the tasks required to extract the information needed for the LandXML files was undertaken. A pipeline was designed to perform the data extraction in a machine learning environment, which has been dubbed Double Filter Capture. It consists of two main workflows that handle the graphical information and the text elements separately, by means of Computer Vision and Optical Character Recognition algorithms, respectively. An implementation of the actions in the pipeline was trialled and barriers encountered discussed. Several Machine Learning algorithms were used for the required tasks, such as line detection, corner detection, image rotation, text detection and text extraction. The project gives some idea of the possibilities and limitations that a larger scale automated back-capture would face, when dealing with records of significantly greater complexity. It also points the way to further research required to refine the extraction process outlined here, for example including elements omitted in this project, such as occupation and other auxiliary information and hand-written records. This project demonstrates automated accurate data extraction from an image file is possible, however an extensive investment would be required in the programming stage, given the complexity and inconsistencies of existing plans that require back-capture
    corecore