research

ComTax: community-driven curation for taxonomic databases

Abstract

This poster presents the work of the ComTax project to develop a community-driven curation process among practicing scientists and citizen scientists. The project provides tools to help scientists identify and validate appropriate taxonomic names from the scanned historical literature. The system operates on scanned documents, typically taken from the Biodiversity Heritage Library, although documents sourced from other repositories could be used. The system is intended to be used on uncorrected text after optical character recognition (OCR) on the scanned images. The key stages are: 1. Identify possible taxonomic names in the scanned text using machine learning techniques. 2. Verify the extracted names against existing databases. If present, the source scanned text can be automatically marked-up with the name. 3. Unverified names might mean they are not currently recorded in the verification databases, typically because the old name in the literature has been reclassified, or because erroneous OCR means that the name is incorrectly transcribed in the scanned text. In either case: 3.1. Present the proposed name to domain experts or citizen scientists for validation or correction, potentially through a voting mechanism to collect expert judgments on the putative taxonomic name. 3.2. Mark-up the scanned text with the corrected spelling of the name and offer validated taxonomic names for further use by the community. This poster will describe the technical challenges facing the ComTax project, and highlight potential extensions of the work to the curation of other entities of interest in the legacy literature or of different disciplines

    Similar works