Feature Combination for Genre Classification

Abstract

In this paper, we describe an experiment on genre classification of Swedish texts, using as predictors the frequency of the top 50 most frequent words in the text collection Stockholm-Umeå Corpus (SUC). The purpose of this particular experiment was to find out if the combination of features in a fully-connected feedforward multi-layer perceptron (MLP) gives better classification than single features in a decision tree. The 1,040 text samples in SUC, classified into 9 major genres, were divided into 10 sets, and used for 10-fold cross-validation training of 10 MLPs (50-7-9), where the hidden layer is supposed to correspond to the 7 stylistic dimensions of Biber (1995). The result was better than for a previous experiment using a decision tree (48.6 vs. 58.8 % misclassification). Given the simplicity of the predictors, the sparse data and skewed distribution of genres in the text collection, the result is rather promising. In order to explain the knowledge learnt by the MLPs, we also extracted decision trees from the input and output of the MLPs. Extra input was generated by sampling from the feature space of the original training data. The resulting trees used finer distinctions (more branches) than the tree from the previous experiment, about the same features but with additional split points, and a few more features.

    Similar works

    Full text

    thumbnail-image

    Available Versions