Wikipedia-based hybrid document representation for textual news classification

Abstract

The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Similar works