372,658 research outputs found

    A study on mutual information-based feature selection for text categorization

    Get PDF
    Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of the most effective methods, by contrast, MI has been demonstrated to have relatively poor performance. According to one existing MI method, the mutual information of a category c and a term t can be negative, which is in conflict with the definition of MI derived from information theory where it is always non-negative. We show that the form of MI used in TC is not derived correctly from information theory. There are two different MI based feature selection criteria which are referred to as MI in the TC literature. Actually, one of them should correctly be termed "pointwise mutual information" (PMI). In this paper, we clarify the terminological confusion surrounding the notion of "mutual information" in TC, and detail an MI method derived correctly from information theory. Experiments with the Reuters-21578 collection and OHSUMED collection show that the corrected MI method’s performance is similar to that of IG, and it is considerably better than PMI

    Feature Selection based on Mutual Information

    Get PDF
    The application of machine learning models such as support vector machine (SVM) and artificial neural networks (ANN) in predicting reservoir properties has been effective in the recent years when compared with the traditional empirical methods. Despite that the machine learning models suffer a lot in the faces of uncertain data which is common characteristics of well log dataset. The reason for uncertainty in well log dataset includes a missing scale, data interpretation and measurement error problems. Feature Selection aimed at selecting feature subset that is relevant to the predicting property. In this paper a feature selection based on mutual information criterion is proposed, the strong point of this method relies on the choice of threshold based on statistically sound criterion for the typical greedy feedforward method of feature selection. Experimental results indicate that the proposed method is capable of improving the performance of the machine learning models in terms of prediction accuracy and reduction in training time

    Feature Selection Based on Dynamic Mutual Information

    Get PDF
    V této práci je proveden rozbor a diskutována problematika implementace metody pro výběr příznaků s názvem Dynamická vzájemná informace (DMIFS). Při studiu popisu DMIFS bylo nalezeno několik nesrovnalostí, které neumožňují úplně napodobit původní algoritmus, proto výsledky DMIFS implementované v rámci této práce byly porovnány s výsledky z článku, kde byla DMIFS publikována. Bylo zjištěno, že implementovaná DMIFS dosahuje podobných výsledků, jako metoda původní. Dále se práce zabývá návrhem dvou nových metod vycházejících z principu DMIFS. První metoda s názvem DmRMR vznikla spojením mRMR a DMIFS. Provedené testy potvrzují lepší výkonnost DmRMR, než jakou má DMIFS je ale méně stabilní. Druhá metoda s názvem WDMIFS je váhovanou variantou DMIFS fungující na bázi AdaBoost. U této metody nedošlo ke zlepšení výkonu. Na závěr je vypracován návod pro implementaci DMIFS do prostředí Weka a RapidMiner.This work analyzes and discuss a issue of implementation feature selection method called Dynamic mutual information (DMIFS). Original description of the DMIFS contains several irregularities, therefore DMIFS can not be implemented exactly as original method. Results of implemented DMIFS is compared with results of original DMIFS. This results shows that implemented DMIFS is similar to the DMIFS. Next part of the work describes design of two new methods based on the DMIFS. The first method called DmRMR merges mRMR and DMIFS. Better performance but worse stability of DmRMR was proved by several tests. The second method called WDMIFS is weighted version of the DMIFS based on AdaBoost algorithm. The WDMIFS has worse performance than DMIFS. Finnaly, manual for implementing DMIFS to RapidMiner and Weka is provided.

    FEATURE SELECTION METHODS BASED ON MUTUAL INFORMATION FOR CLASSIFYING HETEROGENEOUS FEATURES

    Get PDF
    Datasets with heterogeneous features can affect feature selection results that are not appropriate because it is difficult to evaluate heterogeneous features concurrently. Feature transformation (FT) is another way to handle heterogeneous features subset selection. The results of transformation from non-numerical into numerical features may produce redundancy to the original numerical features. In this paper, we propose a method to select feature subset based on mutual information (MI) for classifying heterogeneous features. We use unsupervised feature transformation (UFT) methods and joint mutual information maximation (JMIM) methods. UFT methods is used to transform non-numerical features into numerical features. JMIM methods is used to select feature subset with a consideration of the class label. The transformed and the original features are combined entirely, then determine features subset by using JMIM methods, and classify them using support vector machine (SVM) algorithm. The classification accuracy are measured for any number of selected feature subset and compared between UFT-JMIM methods and Dummy-JMIM methods. The average classification accuracy for all experiments in this study that can be achieved by UFT-JMIM methods is about 84.47% and Dummy-JMIM methods is about 84.24%. This result shows that UFT-JMIM methods can minimize information loss between transformed and original features, and select feature subset to avoid redundant and irrelevant features
    corecore