82,943 research outputs found

    Feature selection in high-dimensional dataset using MapReduce

    Full text link
    This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

    Data Mining on Ice

    Get PDF
    In an atmospheric neutrino analysis for IceCube’s 59-string configuration, the impact of detailed feature selection on the performance of machine learning algorithms has been investigated. Feature selection is guided by the principle of maximum relevance and minimum redundancy. A Random Forest was studied as an example of a more complex learner. Benchmarks were obtained using the simpler learners k-NN and Naive Bayes. Furthermore, a Random Forest was trained and tested in a 5-fold cross validation using 3.5 × 104 simulated signal and 3.5 × 104 simulated background events

    The mRMR variable selection method: a comparative study for functional data

    Full text link
    The use of variable selection methods is particularly appealing in statistical problems with functional data. The obvious general criterion for variable selection is to choose the ‘most representative’ or ‘most relevant’ variables. However, it is also clear that a purely relevance-oriented criterion could lead to select many redundant variables. The minimum Redundance Maximum Relevance (mRMR) procedure, proposed by Ding and Peng (2005) and Peng et al. (2005) is an algorithm to systematically perform variable selection, achieving a reasonable trade-off between relevance and redundancy. In its original form, this procedure is based on the use of the so-calledmutual information criterion to assess relevance and redundancy. Keeping the focus on functional data problems, we propose here a modified version of the mRMR method, obtained by replacing the mutual information by the new association measure (called distance correlation) suggested by Székely et al. (2007). We have also performed an extensive simulation study, including 1600 functional experiments (100 functional models x 4 sample sizes x 4 classifiers) and three real-data examples aimed at comparing the different versions of the mRMR methodology. The results are quite conclusive in favour of the new proposed alternativeThis research has been partially supported by Spanish grant MTM2010- 1736

    Analisis dan Implementasi minimum-Redundancy-Maximum-Relevance (mRMR) Feature selection pada Klasifikasi Data Analysis and Implementation of minimum-Redundancy-Maximum-Relevance (mRMR)Feature selection on Data Classification

    Get PDF
    ABSTRAKSI: Data merupakan salah satu sumber yang digunakan untuk memperoleh suatu informasi. Namun tidak semua data dapat dimanfaatkan dengan baik. Jika data tersebut memiliki struktur yang kompleks, tentu saja akan sulit dimengerti. Sebagai contoh struktur data yang kompleks yaitu microarray yang digunakan pada Tugas Akhir ini. Data tersebut memiliki struktur dimensi yang besar dan multi-label sehingga akan sulit dipahami. Oleh karena itu diperlukan suatu proses feature selection yang bertujuan memperkecil ukuran data namun tetap menghasilkan nilai accuracy yang baik dan tidak terlalu besar jika terdapat penurunan terhadap nilai accuracy pada klasifikasi data.Pada Tugas Akhir ini dilakukan perbandingan metode feature selection yaitu Max-Relevance dengan menggunakan beberapa attribute evaluator antara lain GainRatioAttributeEval, InfoGainAttributeEval, Mutual Information, dan SymmetricalUncertAttributeEval. Metode - metode feature selection tersebut diterapkan secara filtering. Proses klasifikasi data menggunakan bantuan tools weka, teknik klasifikasi yang digunakan ialah Naïve Bayes, dimana teknik ini memperhitungkan jumlah kemunculan data. Dengan melakukan analisis perbandingan terhadap metode feature selection maka dapat diketahui metode feature selection mana yang lebih handal dalam menangani data yang memiliki dimensi yang besar khususnya data microarray. Pengukur evaluasi yang dibandingkan ialah hasil proses klasifikasi yaitu accuracy.Kata Kunci : microarray, feature selection, klasifikasi, naïve bayes, accuracyABSTRACT: Data is one of resources which used for gathering information. However, not all data working good. If the data have a complex structure, it is hard to understand. For Example, microarray which used in this final project. This data have a large dimension structure and multi-label which make it complicated. Therefore, we need feature selection process which is decreasing the data size in spite of resulting a well and sufficient accurate value if there is any accuration declining of data clasiffication.In this final project, comparison of methods feature selection is Max-Relevance by using some attribute evaluator such as GainRatioAttributeEval, InfoGainAttributeEval, Mutual Information, and SymmetricalUncertAttributeEval. Those methods applied by filtering. Process data classification using tools weka, classification technique used is Naïve Bayes, which counting data appearation ammount. By doing comparison analysis to knowable methods we know which feature selection methods more reliable in handling data with big dimension specially microarray data. Measuring evaluation which compared to is the result of classification process, accuracy.Keyword: microarray, feature selection, classification, naïve bayes, accurac

    Hopfield Networks in Relevance and Redundancy Feature Selection Applied to Classification of Biomedical High-Resolution Micro-CT Images

    Get PDF
    We study filter–based feature selection methods for classification of biomedical images. For feature selection, we use two filters — a relevance filter which measures usefulness of individual features for target prediction, and a redundancy filter, which measures similarity between features. As selection method that combines relevance and redundancy we try out a Hopfield network. We experimentally compare selection methods, running unitary redundancy and relevance filters, against a greedy algorithm with redundancy thresholds [9], the min-redundancy max-relevance integration [8,23,36], and our Hopfield network selection. We conclude that on the whole, Hopfield selection was one of the most successful methods, outperforming min-redundancy max-relevance when\ud more features are selected

    Feature Selection for Computer-Aided Polyp Detection using MRMR

    Get PDF
    In building robust classifiers for computer-aided detection (CAD) of lesions, selection of relevant features is of fundamental importance. Typically one is interested in determining which, of a large number of potentially redundant or noisy features, are most discriminative for classification. Searching all possible subsets of features is impractical computationally. This paper proposes a feature selection scheme combining AdaBoost with the Minimum Redundancy Maximum Relevance (MRMR) to focus on the most discriminative features. A fitness function is designed to determine the optimal number of features in a forward wrapper search. Bagging is applied to reduce the variance of the classifier and make a reliable selection. Experiments demonstrate that by selecting just 11 percent of the total features, the classifier can achieve better prediction on independent test data compared to the 70 percent of the total features selected by AdaBoost
    corecore