42,491 research outputs found
Identifying hidden contexts
In this study we investigate how to identify hidden contexts from the data in classification tasks.
Contexts are artifacts in the data, which do not predict the class label directly.
For instance, in speech recognition task speakers might have different accents, which do not directly discriminate between the spoken words.
Identifying hidden contexts is considered as data preprocessing task, which can help to build more accurate classifiers, tailored for particular contexts and give an insight into the data structure.
We present three techniques to identify hidden contexts, which hide class label information from the input data and partition it using clustering techniques.
We form a collection of performance measures to ensure that the resulting contexts are valid.
We evaluate the performance of the proposed techniques on thirty real datasets.
We present a case study illustrating how the identified contexts can be used to build specialized more accurate classifiers
Hierarchical growing cell structures: TreeGCS
We propose a hierarchical clustering algorithm (TreeGCS) based upon the Growing Cell Structure (GCS) neural network of Fritzke. Our algorithm refines and builds upon the GCS base, overcoming an inconsistency in the original GCS algorithm, where the network topology is susceptible to the ordering of the input vectors. Our algorithm is unsupervised, flexible, and dynamic and we have imposed no additional parameters on the underlying GCS algorithm. Our ultimate aim is a hierarchical clustering neural network that is both consistent and stable and identifies the innate hierarchical structure present in vector-based data. We demonstrate improved stability of the GCS foundation and evaluate our algorithm against the hierarchy generated by an ascendant hierarchical clustering dendogram. Our approach emulates the hierarchical clustering of the dendogram. It demonstrates the importance of the parameter settings for GCS and how they affect the stability of the clustering
How Many Topics? Stability Analysis for Topic Models
Topic modeling refers to the task of discovering the underlying thematic
structure in a text corpus, where the output is commonly presented as a report
of the top terms appearing in each topic. Despite the diversity of topic
modeling algorithms that have been proposed, a common challenge in successfully
applying these techniques is the selection of an appropriate number of topics
for a given corpus. Choosing too few topics will produce results that are
overly broad, while choosing too many will result in the "over-clustering" of a
corpus into many small, highly-similar topics. In this paper, we propose a
term-centric stability analysis strategy to address this issue, the idea being
that a model with an appropriate number of topics will be more robust to
perturbations in the data. Using a topic modeling approach based on matrix
factorization, evaluations performed on a range of corpora show that this
strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Constraining the Power Spectrum using Clusters
(Shortened Abstract). We analyze a redshift sample of Abell/ACO clusters and
compare them with numerical simulations based on the truncated Zel'dovich
approximation (TZA), for a list of eleven dark matter (DM) models. For each
model we run several realizations, on which we estimate cosmic variance
effects. We analyse correlation statistics, the probability density function,
and supercluster properties from percolation analysis. As a general result, we
find that the distribution of galaxy clusters provides a constraint only on the
shape of the power spectrum, but not on its amplitude: a shape parameter 0.18 <
\Gamma < 0.25 and an effective spectral index at 20Mpc/h in the range
[-1.1,-0.9] are required by the Abell/ACO data. In order to obtain
complementary constraints on the spectrum amplitude, we consider the cluster
abundance as estimated using the Press--Schechter approach, whose reliability
is explicitly tested against N--body simulations. We conclude that, of the
cosmological models considered here, the only viable models are either Cold+Hot
DM ones with \Omega_\nu = [0.2-0.3], better if shared between two massive
neutrinos, and flat low-density CDM models with \Omega_0 = [0.3-0.5].Comment: 37 pages, Latex file, 9 figures; New Astronomy, in pres
- …