42,491 research outputs found

    Identifying hidden contexts

    Get PDF
    In this study we investigate how to identify hidden contexts from the data in classification tasks. Contexts are artifacts in the data, which do not predict the class label directly. For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words. Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure. We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques. We form a collection of performance measures to ensure that the resulting contexts are valid. We evaluate the performance of the proposed techniques on thirty real datasets. We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers

    Hierarchical growing cell structures: TreeGCS

    Get PDF
    We propose a hierarchical clustering algorithm (TreeGCS) based upon the Growing Cell Structure (GCS) neural network of Fritzke. Our algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors. Our algorithm is unsupervised, flexible, and dynamic and we have imposed no additional parameters on the underlying GCS algorithm. Our ultimate aim is a hierarchical clustering neural network that is both consistent and stable and identifies the innate hierarchical structure present in vector-based data. We demonstrate improved stability of the GCS foundation and evaluate our algorithm against the hierarchy generated by an ascendant hierarchical clustering dendogram. Our approach emulates the hierarchical clustering of the dendogram. It demonstrates the importance of the parameter settings for GCS and how they affect the stability of the clustering

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Constraining the Power Spectrum using Clusters

    Get PDF
    (Shortened Abstract). We analyze a redshift sample of Abell/ACO clusters and compare them with numerical simulations based on the truncated Zel'dovich approximation (TZA), for a list of eleven dark matter (DM) models. For each model we run several realizations, on which we estimate cosmic variance effects. We analyse correlation statistics, the probability density function, and supercluster properties from percolation analysis. As a general result, we find that the distribution of galaxy clusters provides a constraint only on the shape of the power spectrum, but not on its amplitude: a shape parameter 0.18 < \Gamma < 0.25 and an effective spectral index at 20Mpc/h in the range [-1.1,-0.9] are required by the Abell/ACO data. In order to obtain complementary constraints on the spectrum amplitude, we consider the cluster abundance as estimated using the Press--Schechter approach, whose reliability is explicitly tested against N--body simulations. We conclude that, of the cosmological models considered here, the only viable models are either Cold+Hot DM ones with \Omega_\nu = [0.2-0.3], better if shared between two massive neutrinos, and flat low-density CDM models with \Omega_0 = [0.3-0.5].Comment: 37 pages, Latex file, 9 figures; New Astronomy, in pres
    corecore