Location of Repository

Searching for Ground Truth: a stepping stone in automating genre classification

By Dr Yunhyong Kim and Seamus Ross


This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.

Topics: EE Description, EB Identification, LA Ingest, EC Cataloguing, LB Management, EA Metadata
Year: 2007
DOI identifier: 10.1007/978-3-540-77088-6_24
OAI identifier: oai:eprints.erpanet.org:135

Suggested articles



  1. (2004). Automatic categorization of email into folders. benchmark experiments on enron and sri corpora.
  2. (1997). Automatic detection of text genre.
  3. (2003). Automatic document metadata extraction using support vector machines. doi
  4. (2001). Automating the production of bibliographic records.
  5. (1994). Building a large annotated corpus of English: the Penn Treebank.
  6. (2007). Detecting family resemblance: Automated genre classification. doi
  7. (1995). Dimensions of Register Variation:a Cross-Linguistic Comparison. doi
  8. (2005). E.: Clustering document images using a bag of symbols representation. doi
  9. (2005). E.: Data mining: Practical machine learning tools and techniques. 2nd Edition, doi
  10. (2001). Fine-grained document genre classification using first order random graphs. doi
  11. (2006). Genre classification in automated ingest and appraisal metadata. doi
  12. (2006). Implicit reference to citations: A study of astronomy papers.
  13. (2001). Integrating automatic genre analysis into digital libraries. doi
  14. (2003). Investigating GIS and Smoothing for Maximum Entropy Taggers. doi
  15. (2000). Knowledge-based metadata extraction from postscript file. doi
  16. (2006). Learning to classify documents according to genre. doi
  17. (2006). Perc: A personal email classifier. doi
  18. (2005). Preservation research and sustainable digital libraries. doi
  19. (2001). Random forests. doi
  20. (1994). Recognizing text genres with simple metric using discriminant analysis. doi
  21. (1993). Representativeness doi
  22. (2005). Stereotyping the web: genre classification of web documents. Master’s thesis,
  23. The Naming of Cats: Automated genre classification. doi
  24. (2004). Using random forest to learn imbalanced data. http://www.stat.berkeley.edu/ breiman/RandomForests/

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.