6,374 research outputs found

    Wikipedia-based hybrid document representation for textual news classification

    Get PDF
    Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. On the one hand, traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. On the other hand, several authors propose the use of a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy. This paper shows the benefits of using a hybrid representation of documents to the classification of textual news, leveraging the advantages of both approaches-the traditional BoW representation and a BoC approach based on Wikipedia knowledge. To evaluate the proposal, we used three of the most relevant algorithms in the state-of-the art-SVM, Random Forest and Naïve Bayes-and two corpora: the Reuters-21578 corpus and a purpose-built corpus, Reuters-27000. Results obtained show that the performance of the classification algorithm depends on the dataset used, and also demonstrate that the enrichment of the BoW representation with the concepts extracted from documents through the semantic annotator adds useful information to the classifier and improves their performance. Experiments conducted show performance increases up to 4.12% when classifying the Reuters-21578 corpus with the SVM algorithm and up to 49.35% when classifying the corpus Reuters-27000 with the Random Forest algorithm.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Wikipedia-based hybrid document representation for textual news classification

    Get PDF
    The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Text Summarization Techniques: A Brief Survey

    Get PDF
    In recent years, there has been a explosion in the amount of text data from a variety of sources. This volume of text is an invaluable source of information and knowledge which needs to be effectively summarized to be useful. In this review, the main approaches to automatic text summarization are described. We review the different processes for summarization and describe the effectiveness and shortcomings of the different methods.Comment: Some of references format have update

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance
    • 

    corecore