95,002 research outputs found

    A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

    Full text link
    The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013 International Conference on Data Minin

    A SimRank based Ensemble Method for Resolving Challenges of Partition Clustering Methods

    Get PDF
    323–327Traditional clustering techniques alone cannot resolve all challenges of partition-based clustering methods. In the partition based clustering, particularly in variants of K-means, initial cluster centre selection is a significant and crucial point. The dependency of final cluster is totally based on initial cluster centres; hence, this process is delineated to be most significant in the entire clustering operation. The random selection of initial cluster centres is unstable, since different cluster centre points are achieved during each run of the algorithm. Ensemble based clustering methods resolve challenges of partition-based methods. The clustering ensembles join several partitions generated by different clustering algorithms into a single clustering solution. The proposed ensemble methodology resolves initial centroid problems and improves the efficiency of cluster results. This method finds centroid selection through overall mean distance measure. The SimRank based similarity matrix find that the bipartite graph helps to ensemble

    Toward Unbiased Galaxy Cluster Masses from Line of Sight Velocity Dispersions

    Full text link
    We study the use of red sequence selected galaxy spectroscopy for unbiased estimation of galaxy cluster masses. We use the publicly available galaxy catalog produced using the semi-analytic model of De Lucia & Blaizot (2007) on the Millenium Simulation (Springel et al. 2005). We explore the impacts on selection using galaxy color, projected separation from the cluster center, and galaxy luminosity. We study the relationship between cluster mass and velocity dispersion and identify and characterize the following sources of bias and scatter: halo triaxiality, dynamical friction of red luminous galaxies and interlopers. We show that due to halo triaxiality the intrinsic scatter of estimated line of sight dynamical mass is about three times larger (30-40%) than the one estimated using the 3D velocity dispersion (~12%) and a small bias (~1%) is induced. We find evidence of increasing scatter as a function of redshift and provide a fitting formula to account for it. We characterize the amount of bias and scatter introduced by dynamical friction when using subsamples of red-luminous galaxies to estimate the velocity dispersion. We study the presence of interlopers in spectroscopic samples and their effect on the estimated cluster dynamical mass. Our results show that while cluster velocity dispersions extracted from a few dozen red sequence selected galaxies do not provide precise masses on a single cluster basis, an ensemble of cluster velocity dispersions can be combined to produce a precise calibration of a cluster survey mass observable relation. Currently, disagreements in the literature on simulated subhalo velocity dispersion mass relations place a systematic floor on velocity dispersion mass calibration at the 15% level in mass. We show that the selection related uncertainties are small by comparison, providing hope that with further improvements this systematic floor can be reduced.Comment: submitted to Ap

    Towards Multiple-Star Population Synthesis

    Full text link
    The multiplicities of stars, and some other properties, were collected recently by Eggleton & Tokovinin, for the set of 4559 stars with Hipparcos magnitude brighter than 6.0 (4558 excluding the Sun). In this paper I give a numerical recipe for constructing, by a Monte Carlo technique, a theoretical ensemble of multiple stars that resembles the observed sample. Only multiplicities up to 8 are allowed; the observed set contains only multiplicities up to 7. In addition, recipes are suggested for dealing with the selection effects and observational uncertainties that attend the determination of multiplicity. These recipes imply, for example, that to achieve the observed average multiplicity of 1.53, it would be necessary to suppose that the real population has an average multiplicity slightly over 2.0. This numerical model may be useful for (a) comparison with the results of star and star cluster formation theory, (b) population synthesis that does not ignore multiplicity above 2, and (c) initial conditions for dynamical cluster simulations

    Ensemble Generation Methods and Cluster Ensemble Selection with Constraints

    Get PDF
    聚类融合首先生成一个包含多个不同聚类成员的聚类成员集,然后将其合并为一个更准确的共识分区。学者们普遍认为对于优质的聚类融合,其聚类成员应彼此不同,同时每个聚类成员的质量也应维持在一个可接受的水平。许多算法可用于生成不同的基聚类划分。与分类集成相似,诸多研究关注不同聚类成员的生成过程,例如对不同数据子集进行聚类(随机抽样)以及对不同特征子集进行聚类(随机投影)。然而,很少有研究关注这两种不同的抽样方法在质量和差异性上的性能比较。在本文中,我们提出了一种基于随机抽样的聚类成员生成新方法,通过寻找最近邻样本的方式来填补抽样时缺失样本的类别信息(简称为RS-NN)。我们通过与基于传统K-means的聚...Cluster ensemble first generates a large library of different clustering solutions and then combines them into a more accurate consensus clustering. It is commonly accepted that for cluster ensemble to work well the member partitions should be different from each other, and meanwhile the quality of each partition should remain at an acceptable level. Many different strategies have been used to gen...学位:工学硕士院系专业:信息科学与技术学院_模式识别与智能系统学号:2322011115323

    A SimRank based Ensemble Method for Resolving Challenges of Partition Clustering Methods

    Get PDF
    Traditional clustering techniques alone cannot resolve all challenges of partition-based clustering methods. In the partition based clustering, particularly in variants of K-means, initial cluster centre selection is a significant and crucial point. The dependency of final cluster is totally based on initial cluster centres; hence, this process is delineated to be most significant in the entire clustering operation. The random selection of initial cluster centres is unstable, since different cluster centre points are achieved during each run of the algorithm. Ensemble based clustering methods resolve challenges of partition-based methods. The clustering ensembles join several partitions generated by different clustering algorithms into a single clustering solution. The proposed ensemble methodology resolves initial centroid problems and improves the efficiency of cluster results. This method finds centroid selection through overall mean distance measure. The SimRank based similarity matrix find that the bipartite graph helps to ensemble

    Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks

    Full text link
    An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively

    Statistical Thermodynamics of Clustered Populations

    Full text link
    We present a thermodynamic theory for a generic population of MM individuals distributed into NN groups (clusters). We construct the ensemble of all distributions with fixed MM and NN, introduce a selection functional that embodies the physics that governs the population, and obtain the distribution that emerges in the scaling limit as the most probable among all distributions consistent with the given physics. We develop the thermodynamics of the ensemble and establish a rigorous mapping to thermodynamics. We treat the emergence of a so-called "giant component" as a formal phase transition and show that the criteria for its emergence are entirely analogous to the equilibrium conditions in molecular systems. We demonstrate the theory by an analytic model and confirm the predictions by Monte Carlo simulation.Comment: Minor edits to tex
    corecore