Location of Repository

Genre Classification in Automated Ingest and Appraisal Metadata

By Dr Yunhyong Kim and Seamus Ross

Abstract

Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research to automate this process for some classes of digital material. We have segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from documents. Here we propose a classification method built on looking at the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar, as an object with stylo-metric signatures, as an object with intended meaning and purpose, and as an object linked to previously classified objects and other external sources. The results of some experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-facetted approach.

Topics: CG Harvesting, P Curation Issues, EG Representation Information, O Costs, EA Metadata
Year: 2006
DOI identifier: 10.1007/11863878_6
OAI identifier: oai:eprints.erpanet.org:110

Suggested articles

Preview

Citations

  1. (2004). A Shallow Approach To Syntactic Feature Extraction For Genre Classification.
  2. (1998). A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering.
  3. Adobe Acrobat PDF specification: doi
  4. (2003). An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. doi
  5. (2004). Automatic Categorization of Email into Folders.
  6. (1997). Automatic Detection of Text Genre.
  7. (2000). Automatic Document Metadata Extraction using Support Vector Machines. doi
  8. Automatic Metadata Generation: doi
  9. (2001). Automating the production of bibliographic records.
  10. (2005). Clustering Document Images Using a Bag of Symbols Representation. doi
  11. Core metadata editor: http://www.ukoln.ac.uk/metadata/dcdot/
  12. (1995). Dimensions of Register Variation:a Cross-Linguistic Comparison. doi
  13. (2002). Document Understanding for a Broad Class of Documents. doi
  14. (2003). Domain oriented information extraction from the Internet. doi
  15. (2003). E.: Invest to Save: Report
  16. Electronic Resources Preservation Access Network (ERPANET):
  17. ERPANET: Packaged Object Ingest Project.
  18. (2001). Fine-Grained Document Genre Classification Using First Order Random Graphs. doi
  19. Graphics Recognition doi
  20. (2003). Groups: Reference Models for Digital Libraries: Actors and Roles
  21. Initiative: http://dublincore.org/tools/#automaticextraction
  22. (2000). Knowledge-based Metadata Extraction from PostScript File. doi
  23. (2003). Learning Subjective Nouns using Extraction Pattern Bootstrapping. doi
  24. (2002). Machine Learning in Automated Text Categorization’, doi
  25. National Archives UK: DROID (Digital Object Identification).
  26. of New Zealand: Metadata Extraction Tool.
  27. (2006). PERC: A Personal Email Classifier. doi
  28. Performance Comparison of Six Algorithms for Page Segmentation”, doi
  29. PREMIS (PREservation Metadata: Implementation Strategy) Working Group: http://www.oclc.org/research/projects/pmwg/
  30. (2005). Preservation Research and Sustainable Digital Libraries. doi
  31. Python Imaging Library:
  32. (1994). Recognizing Text Genres with Simple Metric using Discriminant Analysis. doi
  33. (2005). Stereotyping the web: genre classification of web documents. Master’s thesis,

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.