Location of Repository

Variations of word frequencies in Genre classification tasks.

By Dr Yunhyong Kim and Seamus Ross


This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments.

Topics: LA Ingest, LB Management, EA Metadata
Year: 2007
OAI identifier: oai:eprints.erpanet.org:134
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://eprints.erpanet.org/134... (external link)
  • Suggested articles


    1. (2004). 5 Random Forest was too computationally intense to examine thoroughly at the time of writing this paper.
    2. (2003). A scalability analysis of classifiers in text categorization. doi
    3. (1998). A tutorial on support vector machines for pattern recognition. doi
    4. (1997). Automatic detection of text genre.
    5. (2003). Automatic document metadata extraction using support vector machines. doi
    6. (2007). Automatic identification of genre in web pages. Thesis submitted for the degree of
    7. (2001). Automating the production of bibliographic records.
    8. (2005). Clustering document images using a bag of symbols representation. doi
    9. (2005). Data mining: Practical machine learning tools and techniques. 2nd Edition, doi
    10. (2007). Detecting family resemblance: Automated genre classification. doi
    11. (1995). Dimensions of Register Variation: a Cross-Linguistic Comparison. New York: doi
    12. (2001). Fine-grained document genre classification using first order random graphs. doi
    13. (2006). Genre classification in automated ingest and appraisal metadata. doi
    14. (2001). Integrating automatic genre analysis into digital libraries. doi
    15. (2000). Knowledge-based metadata extraction from postscript file. doi
    16. (2006). Perc: A personal email classifier. doi
    17. (2005). Preservation research and sustainable digital libraries. doi
    18. (2001). Random Forests. doi
    19. (1994). Recognizing text genres with simple metric using discriminant analysis. doi
    20. (2005). Stereotyping the web: genre classification of web documents. Master’s thesis,
    21. (2000). Text genre detection using common word frequencies. doi

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.