Automating Metadata Extraction: Genre Classification

Kim, Dr Yunhyong; Ross, Seamus

research

Automating Metadata Extraction: Genre Classification

Authors: Dr Yunhyong Kim
Seamus Ross
Publication date: 1 January 2006
Publisher

Abstract

A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Electronic Resource Preservation and Access Network ePRINTS Service

oai:eprints.erpanet.org:111

Last time updated on 02/07/2012