95,002 research outputs found
A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
The combination of multiple classifiers using ensemble methods is
increasingly important for making progress in a variety of difficult prediction
problems. We present a comparative analysis of several ensemble methods through
two case studies in genomics, namely the prediction of genetic interactions and
protein functions, to demonstrate their efficacy on real-world datasets and
draw useful conclusions about their behavior. These methods include simple
aggregation, meta-learning, cluster-based meta-learning, and ensemble selection
using heterogeneous classifiers trained on resampled data to improve the
diversity of their predictions. We present a detailed analysis of these methods
across 4 genomics datasets and find the best of these methods offer
statistically significant improvements over the state of the art in their
respective domains. In addition, we establish a novel connection between
ensemble selection and meta-learning, demonstrating how both of these disparate
methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013
International Conference on Data Minin
A SimRank based Ensemble Method for Resolving Challenges of Partition Clustering Methods
323–327Traditional clustering techniques alone cannot resolve all challenges of partition-based clustering methods. In the partition based clustering, particularly in variants of K-means, initial cluster centre selection is a significant and crucial point. The dependency of final cluster is totally based on initial cluster centres; hence, this process is delineated to be most significant in the entire clustering operation. The random selection of initial cluster centres is unstable, since different cluster centre points are achieved during each run of the algorithm. Ensemble based clustering methods resolve challenges of partition-based methods. The clustering ensembles join several partitions generated by different clustering algorithms into a single clustering solution. The proposed ensemble methodology resolves initial centroid problems and improves the efficiency of cluster results. This method finds centroid selection through overall mean distance measure. The SimRank based similarity matrix find that the bipartite graph helps to ensemble
Toward Unbiased Galaxy Cluster Masses from Line of Sight Velocity Dispersions
We study the use of red sequence selected galaxy spectroscopy for unbiased
estimation of galaxy cluster masses. We use the publicly available galaxy
catalog produced using the semi-analytic model of De Lucia & Blaizot (2007) on
the Millenium Simulation (Springel et al. 2005). We explore the impacts on
selection using galaxy color, projected separation from the cluster center, and
galaxy luminosity. We study the relationship between cluster mass and velocity
dispersion and identify and characterize the following sources of bias and
scatter: halo triaxiality, dynamical friction of red luminous galaxies and
interlopers. We show that due to halo triaxiality the intrinsic scatter of
estimated line of sight dynamical mass is about three times larger (30-40%)
than the one estimated using the 3D velocity dispersion (~12%) and a small bias
(~1%) is induced. We find evidence of increasing scatter as a function of
redshift and provide a fitting formula to account for it. We characterize the
amount of bias and scatter introduced by dynamical friction when using
subsamples of red-luminous galaxies to estimate the velocity dispersion. We
study the presence of interlopers in spectroscopic samples and their effect on
the estimated cluster dynamical mass. Our results show that while cluster
velocity dispersions extracted from a few dozen red sequence selected galaxies
do not provide precise masses on a single cluster basis, an ensemble of cluster
velocity dispersions can be combined to produce a precise calibration of a
cluster survey mass observable relation. Currently, disagreements in the
literature on simulated subhalo velocity dispersion mass relations place a
systematic floor on velocity dispersion mass calibration at the 15% level in
mass. We show that the selection related uncertainties are small by comparison,
providing hope that with further improvements this systematic floor can be
reduced.Comment: submitted to Ap
Towards Multiple-Star Population Synthesis
The multiplicities of stars, and some other properties, were collected
recently by Eggleton & Tokovinin, for the set of 4559 stars with Hipparcos
magnitude brighter than 6.0 (4558 excluding the Sun). In this paper I give a
numerical recipe for constructing, by a Monte Carlo technique, a theoretical
ensemble of multiple stars that resembles the observed sample. Only
multiplicities up to 8 are allowed; the observed set contains only
multiplicities up to 7. In addition, recipes are suggested for dealing with the
selection effects and observational uncertainties that attend the determination
of multiplicity. These recipes imply, for example, that to achieve the observed
average multiplicity of 1.53, it would be necessary to suppose that the real
population has an average multiplicity slightly over 2.0.
This numerical model may be useful for (a) comparison with the results of
star and star cluster formation theory, (b) population synthesis that does not
ignore multiplicity above 2, and (c) initial conditions for dynamical cluster
simulations
Ensemble Generation Methods and Cluster Ensemble Selection with Constraints
聚类融合首先生成一个包含多个不同聚类成员的聚类成员集,然后将其合并为一个更准确的共识分区。学者们普遍认为对于优质的聚类融合,其聚类成员应彼此不同,同时每个聚类成员的质量也应维持在一个可接受的水平。许多算法可用于生成不同的基聚类划分。与分类集成相似,诸多研究关注不同聚类成员的生成过程,例如对不同数据子集进行聚类(随机抽样)以及对不同特征子集进行聚类(随机投影)。然而,很少有研究关注这两种不同的抽样方法在质量和差异性上的性能比较。在本文中,我们提出了一种基于随机抽样的聚类成员生成新方法,通过寻找最近邻样本的方式来填补抽样时缺失样本的类别信息(简称为RS-NN)。我们通过与基于传统K-means的聚...Cluster ensemble first generates a large library of different clustering solutions and then combines them into a more accurate consensus clustering. It is commonly accepted that for cluster ensemble to work well the member partitions should be different from each other, and meanwhile the quality of each partition should remain at an acceptable level. Many different strategies have been used to gen...学位:工学硕士院系专业:信息科学与技术学院_模式识别与智能系统学号:2322011115323
A SimRank based Ensemble Method for Resolving Challenges of Partition Clustering Methods
Traditional clustering techniques alone cannot resolve all challenges of partition-based clustering methods. In the partition based clustering, particularly in variants of K-means, initial cluster centre selection is a significant and crucial point. The dependency of final cluster is totally based on initial cluster centres; hence, this process is delineated to be most significant in the entire clustering operation. The random selection of initial cluster centres is unstable, since different cluster centre points are achieved during each run of the algorithm. Ensemble based clustering methods resolve challenges of partition-based methods. The clustering ensembles join several partitions generated by different clustering algorithms into a single clustering solution. The proposed ensemble methodology resolves initial centroid problems and improves the efficiency of cluster results. This method finds centroid selection through overall mean distance measure. The SimRank based similarity matrix find that the bipartite graph helps to ensemble
Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks
An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively
Statistical Thermodynamics of Clustered Populations
We present a thermodynamic theory for a generic population of individuals
distributed into groups (clusters). We construct the ensemble of all
distributions with fixed and , introduce a selection functional that
embodies the physics that governs the population, and obtain the distribution
that emerges in the scaling limit as the most probable among all distributions
consistent with the given physics. We develop the thermodynamics of the
ensemble and establish a rigorous mapping to thermodynamics. We treat the
emergence of a so-called "giant component" as a formal phase transition and
show that the criteria for its emergence are entirely analogous to the
equilibrium conditions in molecular systems. We demonstrate the theory by an
analytic model and confirm the predictions by Monte Carlo simulation.Comment: Minor edits to tex
- …