1,922 research outputs found

    Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensemble: A Survey

    Get PDF
    Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods

    Evolutionary feature creation for ensembles

    Get PDF
    Evolutionary feature creation for ensembles is about the generation of new attributes useful to build classifiers and ensembles of classifiers (EoC), based on evolutionary algorithms. The new attributes consist in transformations applied to the original raw features into a different space with the same or smaller cardinality, so that the subsequent classification process is simpler to be executed and provide better results. The feature creation process is intended towards the generation of ensembles using the built classifiers. Bot's method is based on Genetic Programming (GP) (Koza, 1992), which "builds the features" that define the classifier. GP is used because it has the ability to discover underlying data relationships and express them mathematically (Kishore et al, 2000) establishing the structure and values of the solution (Guo et al., 2005). As the evolution progresses, GP discards the raw features that are not useful to solve the problem. Thus, by applying genetic programming we are doing feature construction and a sort of feature selection at the same time. The method to generate the classifiers is based on a method proposed in (Bot, 2001) and it is called here Bot's method because of its author. Bot's method uses GP and consists in creating one feature at a time and guiding the evolution of new evolved features with the aid of the improvement in recognition rate of the proposed new feature in conjunction with the already evolved features. Bot's method improves performance with each new evolved feature by adding diversity and eluding the over-fitting phenomenon. We have improved Bot's method in two ways: adding a global validation procedure to control the over-fitting and setting it with the island method. The classifiers created by building the features based on GP are called evolved classifiers and they represent the elements of the ensembles to be generated. We choose random subspaces method (Ho, 1998b) to generate ensembles and we have proposed two strategies to create EoC. In the first one, we combine the votes from each evolved classifier feature by feature. The performance obtained is slightly better than an ensemble of raw random subspaces. What is more, the same performance level of an ensemble of raw random subspaces is attained after some features. As a result, we can build EoC to assure certain performance with the minimum number of evolved features. This reduces the complexity of the ensemble without reducing the performance. The second strategy proposed is to create the ensembles based on finding for each base classifier, the maximum number of evolved features before over-fitting the optimization data set. The base classifiers have then different number of evolved features but each one provides the best recognition rate controlling at maximum the over-fitting. Also, in this case we built ensembles with better performance than the ensemble of raw random subspaces. Furthermore, our performance results with cardinality of 9 to 12 evolved classifiers are close to the best ensembles reported in (Tremblay, 2004) with cardinality in the order of 30 base classifiers

    Toward a General-Purpose Heterogeneous Ensemble for Pattern Classification

    Get PDF
    We perform an extensive study of the performance of different classification approaches on twenty-five datasets (fourteen image datasets and eleven UCI data mining datasets). The aim is to find General-Purpose (GP) heterogeneous ensembles (requiring little to no parameter tuning) that perform competitively across multiple datasets. The state-of-the-art classifiers examined in this study include the support vector machine, Gaussian process classifiers, random subspace of adaboost, random subspace of rotation boosting, and deep learning classifiers. We demonstrate that a heterogeneous ensemble based on the simple fusion by sum rule of different classifiers performs consistently well across all twenty-five datasets. The most important result of our investigation is demonstrating that some very recent approaches, including the heterogeneous ensemble we propose in this paper, are capable of outperforming an SVM classifier (implemented with LibSVM), even when both kernel selection and SVM parameters are carefully tuned for each dataset

    Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks

    Full text link
    An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively

    EapGAFS: Microarray Dataset for Ensemble Classification for Diseases Prediction

    Get PDF
    Microarray data stores the measured expression levels of thousands of genes simultaneously which helps the researchers to get insight into the biological and prognostic information. Cancer is a deadly disease that develops over time and involves the uncontrolled division of body cells. In cancer, many genes are responsible for cell growth and division. But different kinds of cancer are caused by a different set of genes. So to be able to better understand, diagnose and treat cancer, it is essential to know which of the genes in the cancer cells are working abnormally. The advances in data mining, machine learning, soft computing, and pattern recognition have addressed the challenges posed by the researchers to develop computationally effective models to identify the new class of disease and develop diagnostic or therapeutic targets. This paper proposed an Ensemble Aprior Gentic Algorithm Feature Selection (EapGAFS) for microarray dataset classification. The proposed algorithm comprises of the genetic algorithm implemented with aprior learning for the microarray attributes classification. The proposed EapGAFS uses the rule set mining in the genetic algorithm for the microarray dataset processing. Through framed rule set the proposed model extract the attribute features in the dataset. Finally, with the ensemble classifier model the microarray dataset were classified for the processing. The performance of the proposed EapGAFS is conventional classifiers for the collected microarray dataset of the breast cancer, Hepatities, diabeties, and bupa. The comparative analysis of the proposed EapGAFS with the conventional classifier expressed that the proposed EapGAFS exhibits improved performance in the microarray dataset classification. The performance of the proposed EapGAFS is improved ~4 – 6% than the conventional classifiers such as Adaboost and ensemble
    • …
    corecore