15 research outputs found
Frequent semantic patterns for document relevance ranking
Modelling user interest has been a challenge for improving the performance of information filtering systems (IFs). Currently, there have been term-based, phrase-based, and pattern-based approaches in modelling user interest [2, 5, 13]. Patterns have been said to convey more specific and relevant information in modelling user’s interest [5]. However, the existing patterns such as frequent and closed patterns are all generated based on their statistical features such as frequency. But their semantic meaning was ignored. This study proposes a new information filtering model named as Frequent Semantic Patterns for Document Relevance Ranking, shorted as FSPnIF. In particular, a new type of patterns, called frequent semantic pattern (FSP), is proposed to represent user’s interest. The patterns are representative as they are generated from the top highly frequent words in the training corpus. These patterns also convey semantic meanings because they are verified by meaningful concepts in ontology. A new method to measure document relevance based on FSPs is also proposed to filter relevant documents in IFs. The model was evaluated in IFs using RCV1 and R8 datasets. The results of extensive experiments show that the new proposed model significantly outperformed all the state-of-the-art baseline models according to five main evaluating measures.</p
A Bidimensional View of Documents for Text Categorisation
The question addressed in this paper is to find a bidimensional representation of textual documents for the problem of text categorisation. The projection of documents is performed following subsequent steps. The main idea is to consider a possible double aspect of the importance of a word: the local importance in a category, and the global importance in the rest of the categories. This information is combined properly and summarized in two coordinates. Then, a machine learning method may be used in this simple bidimensional space to classify the documents. The results that can be obtained in this space are satisfactory with respect to the best state-of-the-art performances