This paper builds on the work presented at the ECDL 2006 () in automated genre classifcation as a step toward automating metadata extraction from digital documents for ingest into digital repositories such as those run by archives, libraries and eprint services. We divide features of the documents into five types: features for visual layout, linguistically modeled syntactic features, stylo-metric features, features for semantic structure, and contextual features as an object linked to previously classified objects and other external sources. Results concerning the first two types have been described elsewhere(). The current paper discusses results from testing classifiers based on image and stylometric features and shows that genres for which image features fail to cluster are the genres for which stylo-metric features cluster very well.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.