Location of Repository

Automating Metadata Extraction: Genre Classification

By Dr Yunhyong Kim and Seamus Ross


A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.

Topics: EE Description, CG Harvesting, P Curation Issues, EA Metadata, E Data Description, Documentation and Standards
Year: 2006
OAI identifier: oai:eprints.erpanet.org:111

Suggested articles



  1. (1998). A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering, http://www.cs.cmu.edu/ mccallum/
  2. Adobe Acrobat PDF specification, doi
  3. (1997). Automatic Detection of Text Genre',
  4. (2000). Automatic Document Metadata Extraction using Support Vector Machines', doi
  5. DROID (Digital Object Identification),
  6. (2005). ERSS 2005:Coreference-based Summarization Reloaded',
  7. (2006). Genre Classification in Automated Ingest and Appraisal doi
  8. Initiative, doi
  9. (2000). Knowledgebased Metadata Extraction from PostScript File',
  10. (2003). Learning Subjective Nouns using Extraction Pattern Bootstrapping', doi
  11. (2002). Machine Learning in Automated Text Categorization', doi
  12. (2003). Models for Digital Libraries: Actors and Roles' doi
  13. Object Ingest Project, http://www.erpanet.org/events/2003/rome/presentatio ns/ ross rusbridge pres.pdf
  14. (2005). Preservation Research and Sustainable Digital Libraries', doi
  15. (1994). Recognizing Text Genres with Simple Metric using Discriminant Analysis', doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.