28,293 research outputs found

    Text Categorization based on Clustering Feature Selection

    Get PDF
    AbstractIn this paper, we discuss a text categorization method based on k-means clustering feature selection. K-means is classical algorithm for data clustering in text mining, but it is seldom used for feature selection. For text data, the words that can express correct semantic in a class are usually good features. We use k-means method to capture several cluster centroids for each class, and then choose the high frequency words in centroids as the text features for categorization. The words extracted by k-means not only can represent each class clustering well, but also own high quality for semantic expression. On three normal text databases, classifiers based on our feature selection method exhibit better performances than original classifiers for text categorization

    A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents

    Get PDF
    Selection of highly discriminative feature in text document plays a major challenging role in categorization. Feature selection is an important task that involves dimensionality reduction of feature matrix, which in turn enhances the performance of categorization. This article presents a new feature selection method based on Intuitionistic Fuzzy Entropy (IFE) for Text Categorization. Firstly, Intuitionistic Fuzzy C-Means (IFCM) clustering method is employed to compute the intuitionistic membership values. The computed intuitionistic membership values are used to estimate intuitionistic fuzzy entropy via Match degree. Further, features with lower entropy values are selected to categorize the text documents. To find the efficacy of the proposed method, experiments are conducted on three standard benchmark datasets using three classifiers. F-measure is used to assess the performance of the classifiers. The proposed method shows impressive results as compared to other well known feature selection methods. Moreover, Intuitionistic Fuzzy Set (IFS) property addresses the uncertainty limitations of traditional fuzzy set

    Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks

    Full text link
    An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively

    Feature Selection for Categorization of Online News Articles in Myanmar Language

    Get PDF
    In text mining, the feature selection plays an important role to reduce the high dimensionality of feature space. It can improve the accuracy of the document clustering process and help to avoid overfitting problem. Nowadays, the enormous amount of news article documents is widely available on the internet due to the rapid development of the web. Consequently, there is an urgent need to extract useful content from overloaded information. The categorization of online text documents is crucial to avoid information overload and it can help readers to find rapidly their interesting topic. The problem arises for text categorization is the large number of features space. This study has two phases, documents preprocessing and feature selection. Document preprocessing contains documents collection, syllable segmentation, word segmentation, removing stop words for extracting features from the collection of Myanmar online news documents including sport, health, crime etc. In this study, TF-IDF weighting method is adapted for feature selection. The experimental result shows the adapted TF-IDF method has higher performance than based TF-IDF method

    Improving web search by categorization, clustering, and personalization

    Get PDF
    This research combines Web snippet1 categorization, clustering and personalization techniques to recommend relevant results to users. RIB - Recommender Intelligent Browser which categorizes Web snippets using socially constructed Web directory such as the Open Directory Project (ODP) is to bedeveloped. By comparing the similarities between the semantics of each ODP category represented by the category-documents and the Web snippets, the Web snippets are organized into a hierarchy. Meanwhile, the Web snippets are clustered to boost the quality of the categorization. Based on an automatically formed user profile which takes into consideration desktop computer informationand concept drift, the proposed search strategy recommends relevant search results to users. This research also intends to verify text categorization, clustering, and feature selection algorithms in the context where only Web snippets are available

    A new unsupervised feature selection method for text clustering based on genetic algorithms

    Get PDF
    Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method
    • …
    corecore