Location of Repository

"The Naming of Cats": Automated Genre Classification.

By Dr Yunhyong Kim and Seamus Ross

Abstract

This paper builds on the work presented at the ECDL 2006 ([29]) in automated genre classifcation as a step toward automating metadata extraction from digital documents for ingest into digital repositories such as those run by archives, libraries and eprint services. We divide features of the documents into five types: features for visual layout, linguistically modeled syntactic features, stylo-metric features, features for semantic structure, and contextual features as an object linked to previously classified objects and other external sources. Results concerning the first two types have been described elsewhere([29]). The current paper discusses results from testing classifiers based on image and stylometric features and shows that genres for which image features fail to cluster are the genres for which stylo-metric features cluster very well.

Topics: M Resource Discovery, EA Metadata
Year: 2006
DOI identifier: 10.2218/ijdc.v2i1.13
OAI identifier: oai:eprints.erpanet.org:123

Suggested articles

Preview

Citations

  1. (2004). A Shallow Approach To Syntactic Feature Extraction For Genre Classi
  2. (2004). A Shallow Approach To Syntactic Feature Extraction For Genre Classification.
  3. Adobe Acrobat PDF speci doi
  4. Adobe Acrobat PDF specification: doi
  5. (2003). An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. doi
  6. (2004). Anaphora Resolution for Automatic Citation Linking. Masters Thesis, MSc. for Speech and Language Processing,
  7. (2004). Automatic Categorization of Email into Folders. Benchmark Experiments on Enron and SRI Corpora',
  8. (2000). Automatic Document Metadata Extraction using Support Vector Machines. doi
  9. Automatic Metadata Generation: doi
  10. (2006). Automating Metadata Extraction: Genre Classi Poster, UK e-Science All Hands Meeting
  11. (2006). Automating Metadata Extraction: Genre Classification Poster, UK e-Science All Hands Meeting
  12. (2001). Automating the production of bibliographic records.
  13. B.: XPDF PDF document viewer.
  14. (2005). Clustering Document Images Using a Bag of Symbols Representation. doi
  15. (1995). Dimensions of Register Variation:a Cross-Linguistic Comparison. doi
  16. (2002). Document Understanding for a Broad Class of Documents. doi
  17. (2003). Domain oriented information extraction from the Internet. doi
  18. (2005). E.: Data Mining: Practical machine Learning tools and techniques. 2nd Edition, doi
  19. (2003). E.: Invest to Save: Report
  20. ERPANET: Packaged Object Ingest Project.
  21. (2001). Fine-Grained Document Genre Classi Using First Order Random Graphs. doi
  22. (2006). Genre Classi in Automated Ingest and Appraisal Metadata. doi
  23. (2006). Genre Classification in Automated Ingest and Appraisal Metadata. doi
  24. Graphics Recognition doi
  25. (2003). Groups: Reference Models for Digital Libraries: Actors and Roles
  26. Implicit Reference to Citations: A study of astronomy papers. (preprint, reference available upon request)
  27. Initiative: http://dublincore.org/tools/#automaticextraction
  28. (2001). Integrating Automatic Genre Analysis into Digital Libraries. doi
  29. (2003). Investigating GIS and Smoothing for Maximum Entropy Taggers. Proceedings, Aunnual Meeting, doi
  30. (2000). Knowledge-based Metadata Extraction from PostScript File. doi
  31. (2003). Learning Subjective Nouns using Extraction Pattern Bootstrapping. 7th CoNLL, doi
  32. (2006). Learning to Classify Documents According to Genre. doi
  33. (2002). Machine Learning in Automated Text Categorization. doi
  34. MetadataExtractor: http://pami.uwaterloo.ca/ (follow the link for Text Mining)
  35. of New Zealand: Metadata Extraction Tool.
  36. (2006). PERC: A Personal Email Classi doi
  37. (2006). PERC: A Personal Email Classifier. doi
  38. (2006). Performance Comparison of Six Algorithms for Page Segmentation. 7th IAPR Workshop on Document Analysis Systems (DAS) doi
  39. PREMIS (PREservation Metadata: Implementation Strategy) Working Group: http://www.oclc.org/research/projects/pmwg/
  40. (2005). Preservation Research and Sustainable Digital Libraries. doi
  41. Python Imaging Library:
  42. (1994). Recognizing Text Genres with Simple Metric using Discriminant Analysis. doi
  43. (2004). State-of-the-art on Automatic Genre Identi
  44. (2004). State-of-the-art on Automatic Genre Identification.
  45. (2005). Stereotyping the web: genre classi of web documents. Master's thesis,

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.