Location of Repository

Formulating representative features with respect to document genre classification

By Dr Yunhyong Kim and Seamus Ross

Abstract

Genre classification (e.g. whether a document is a scientific article or magazine article) is closely bound to the physical and conceptual structure of document as well as the level of depth involved in the text. Hence, it provides a means of ranking documents retrieved by search tools according to metrics other than topical similarity. Moreover, the structural information derived from genre classification can be used to locate target information within the text. In previous studies, the detection of genre classes has been attempted by using some normalised frequency of terms or combinations of terms in the document (here, we are using term as a reference to words, phrases, syntactic units, sentences and paragraphs, as well as other patterns derived from deeper linguistic or semantic analysis). These approaches largely neglect how the term is distributed throughout the document. Here, we report the results of automated experiments based on distributive statistics of words in order to present evidence that term distribution pattern is a better indicator of genre class than term frequency.

Topics: M Resource Discovery, LA Ingest, EA Metadata
Year: 2008
DOI identifier: 10.1007/978-90-481-9178-9_6
OAI identifier: oai:eprints.erpanet.org:154

Suggested articles

Preview

Citations

  1. A scalability analysis of classifiers in text categorization. doi
  2. (1996). A toolkit for statistical language modeling, text retrieval, classification and clustering.
  3. (2008). An Examination of Genre Attributes for Web Page Classification. doi
  4. (2004). Automatic categorization of email into folders: benchmark experiments on enron and sri corpora.
  5. (1997). Automatic detection of text genre.
  6. (2003). Automatic document metadata extraction using support vector machines. doi
  7. (2001). Automating the production of bibliographic records.
  8. (1998). Clumping properties of content-bearing words. doi
  9. Clustering document images using a bag of symbols representation. doi
  10. (2005). Data mining: Practical machine learning tools and techniques. 2nd edition, doi
  11. (2007). Detecting family resemblance: Automated genre classification. doi
  12. (1995). Dimensions of Register Variation: a Cross-Linguistic Comparison. doi
  13. (2001). Fine-grained document genre classification using first order random graphs. doi
  14. (1999). Foundations of Statistical Language Processing, doi
  15. (2004). Frequent Term Distribution Measures for Dataset Profiling.
  16. (2001). Integrating automatic genre analysis into digital libraries. doi
  17. (2000). Knowledge-based metadata extraction from postscript file. doi
  18. (2006). Learning to classify documents according to genre. doi
  19. (2006). Perc: A personal email classifier. doi
  20. (2007). PhD thesis, doi
  21. (2005). Preservation research and sustainable digital libraries. doi
  22. (1994). Recognizing text genres with simple metric using discriminant analysis. doi
  23. (2007). Searching for Ground truth: a stepping stone in automated genre classification. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.