86,770 research outputs found

    An Enhanced Random Linear Oracle Ensemble Method using Feature Selection Approach based on Naïve Bayes Classifier

    Get PDF
    Random Linear Oracle (RLO) ensemble replaced each classifier with two mini-ensembles, allowing base classifiers to be trained using different data set, improving the variety of trained classifiers. Naïve Bayes (NB) classifier was chosen as the base classifier for this research due to its simplicity and computational inexpensive. Different feature selection algorithms are applied to RLO ensemble to investigate the effect of different sized data towards its performance. Experiments were carried out using 30 data sets from UCI repository, as well as 6 learning algorithms, namely NB classifier, RLO ensemble, RLO ensemble trained with Genetic Algorithm (GA) feature selection using accuracy of NB classifier as fitness function, RLO ensemble trained with GA feature selection using accuracy of RLO ensemble as fitness function, RLO ensemble trained with t-test feature selection, and RLO ensemble trained with Kruskal-Wallis test feature selection. The results showed that RLO ensemble could significantly improve the diversity of NB classifier in dealing with distinctively selected feature sets through its fusionselection paradigm. Consequently, feature selection algorithms could greatly benefit RLO ensemble, with properly selected number of features from filter approach, or GA natural selection from wrapper approach, it received great classification accuracy improvement, as well as growth in diversity

    Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data

    Get PDF
    Ensemble classification is a well-established approach that involves fusing the decisions of multiple predictive models. A similar “ensemble logic” has been recently applied to challenging feature selection tasks aimed at identifying the most informative variables (or features) for a given domain of interest. In this work, we discuss the rationale of ensemble feature selection and evaluate the effects and the implications of a specific ensemble approach, namely the data perturbation strategy. Basically, it consists in combining multiple selectors that exploit the same core algorithm but are trained on different perturbed versions of the original data. The real potential of this approach, still object of debate in the feature selection literature, is here investigated in conjunction with different kinds of core selection algorithms (both univariate and multivariate). In particular, we evaluate the extent to which the ensemble implementation improves the overall performance of the selection process, in terms of predictive accuracy and stability (i.e., robustness with respect to changes in the training data). Furthermore, we measure the impact of the ensemble approach on the final selection outcome, i.e. on the composition of the selected feature subsets. The results obtained on ten public genomic benchmarks provide useful insight on both the benefits and the limitations of such ensemble approach, paving the way to the exploration of new and wider ensemble schemes

    Analisis dan Implementasi Genetic Algorithm-Sequential Ensemble Feature Selection (GA-SEFS) untuk Ensemble Feature Selection

    Get PDF
    ABSTRAKSI: Sebuah data yang ada saat ini bisa memiliki feature yang banyak. Banyaknya feature yang bisa dimiliki oleh satu objek instance belum tentu merupakan informasi relevan yang dibutuhkan oleh sistem data mining. Feature selection adalah suatu proses memilih subset dari feature/atribut yang relevan dengan menggunakan kriteria tertentu. Dengan melakukan feature selection ini mampu untuk mengurangi jumlah feature yang tidak relevan, menghilangkan redundansi data, dan meningkatkan akurasi learning.Klasifikasi merupakan salah satu tahapan dalam data mining, yang fungsinya adalah untuk memprediksi keanggotaan atau kelas dari suatu data. Dalam beberapa studi ditunjukkan bahwa sebuah ensemble (himpunan) dari beberapa classifier umumnya lebih akurat dari classifier tunggal. Salah satu cara untuk menghasilkan sebuah ensemble adalah dengan memilih beberapa feature subset yang berbeda dari dataset asli dan untuk setiap feature subset tersebut selanjutnya dilakukan klasifikasi. Pendekatan ini dikenal sebagai ensemble feature selection. Di sini, penulis akan mencoba mengimplementasikan genetic algorithm untuk optimasi feature selection dalam pembentukan ensemble, yaitu Genetic Algorithm-Sequential Ensemble Feature Selection (GA-SEFS). Algoritma feature selection yang konvensional bertujuan untuk menemukan feature subset terbaik, sedangkan ensemble feature selection mempunyai tujuan untuk menemukan himpunan feature subset terbaik yang dapat meningkatkan akurasi dalam klasifikasi.Dalam GA-SEFS terdapat 6 parameter penting. Parameter ukuran populasi, jumlah generasi, dan offspring tidak berpengaruh secara langsung terhadap akurasi yang dihasilkan dari klasifikasi ensemble. Parameter ukuran ensemble dapat membantu peningkatan akurasi dikarenakan vote feature subset beragam yang mampu membantu meningkatkan akurasi. Parameter alpha dapat membantu memberikan peningkatan akurasi tinggi yang didapat oleh kombinasi 4 parameter diatas (ukuran ensemble, jumlah populasi, jumlah generasi, dan jumlah offspring). Parameter beta dalam percobaan Tugas Akhir ini untuk tiga dataset berbeda ternyata lebih memberikan nilai akurasi yang tinggi pada nilai beta negatif.Kata Kunci : feauret subset selection, ensemble, genetic searchABSTRACT: A current data can have a lot of features. The number of features that can be owned by a single object instance is not necessarily the relevant information required by the data mining system. Feature selection is a process of selecting a subset of features / attributes that are relevant to using certain criteria. By doing feature selection is able to reduce the number of irrelevant features, eliminating data redundancy, and improve the accuracy of learning.Classification is one of the stages in data mining, whose function is to predict membership or classes of data. In some studies indicated that an ensemble (set) of some of the classifier is generally more accurate than a single classifier. One way to generate an ensemble is to choose several different subset of features from the original dataset and for each feature subset is then performed classification. This approach is known as ensemble feature selection. Here, the author will try to implement a genetic algorithm for optimization of feature selection in the formation of ensembles, namely Genetic Algorithm-Sequential Ensemble Feature Selection (GA-SEFS). Conventional feature selection algorithms aim to find the best feature subset, while the ensemble feature selection has the objective to find the best subset of the set of features that can improve the accuracy in classification.In GA-SEFS contained six important parameters. Parameters of population size, number of generations, and the offspring do not directly affect the resulting accuracy of the classification ensemble. Ensemble size parameter can help to increase the accuracy of vote due to a variety of feature subset that can help improve accuracy. Alpha parameter can help to provide improved accuracy obtained by the combination of the above 4 parameters (ensemble size, population, number of generations, and the number of offspring). Beta parameter in this Final trial for three different datasets were further provide high accuracy values on the value of a negative beta.Keyword: subset feauret selection, ensemble, genetic searc

    Minimalist Ensemble Algorithms for Genome-Wide Protein Localization Prediction

    Get PDF
    Background Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. Results This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. Conclusions We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi

    Multiple Relevant Feature Ensemble Selection Based on Multilayer Co-Evolutionary Consensus MapReduce

    Full text link
    IEEE Although feature selection for large data has been intensively investigated in data mining, machine learning, and pattern recognition, the challenges are not just to invent new algorithms to handle noisy and uncertain large data in applications, but rather to link the multiple relevant feature sources, structured, or unstructured, to develop an effective feature reduction method. In this paper, we propose a multiple relevant feature ensemble selection (MRFES) algorithm based on multilayer co-evolutionary consensus MapReduce (MCCM). We construct an effective MCCM model to handle feature ensemble selection of large-scale datasets with multiple relevant feature sources, and explore the unified consistency aggregation between the local solutions and global dominance solutions achieved by the co-evolutionary memeplexes, which participate in the cooperative feature ensemble selection process. This model attempts to reach a mutual decision agreement among co-evolutionary memeplexes, which calls for the need for mechanisms to detect some noncooperative co-evolutionary behaviors and achieve better Nash equilibrium resolutions. Extensive experimental comparative studies substantiate the effectiveness of MRFES to solve large-scale dataset problems with the complex noise and multiple relevant feature sources on some well-known benchmark datasets. The algorithm can greatly facilitate the selection of relevant feature subsets coming from the original feature space with better accuracy, efficiency, and interpretability. Moreover, we apply MRFES to human cerebral cortex-based classification prediction. Such successful applications are expected to significantly scale up classification prediction for large-scale and complex brain data in terms of efficiency and feasibility

    Heuristic ensembles of filters for accurate and reliable feature selection

    Get PDF
    Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out
    corecore