8,142 research outputs found
Bootstrapping estimates of stability for clusters, observations and model selection
Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN)
Cosmic shear analysis of archival HST/ACS data: I. Comparison of early ACS pure parallel data to the HST/GEMS Survey
This is the first paper of a series describing our measurement of weak
lensing by large-scale structure using archival observations from the Advanced
Camera for Surveys (ACS) on board the Hubble Space Telescope (HST).
In this work we present results from a pilot study testing the capabilities
of the ACS for cosmic shear measurements with early parallel observations and
presenting a re-analysis of HST/ACS data from the GEMS survey and the GOODS
observations of the Chandra Deep Field South (CDFS). We describe our new
correction scheme for the time-dependent ACS PSF based on observations of
stellar fields. This is currently the only technique which takes the full time
variation of the PSF between individual ACS exposures into account. We estimate
that our PSF correction scheme reduces the systematic contribution to the shear
correlation functions due to PSF distortions to < 2*10^{-6} for galaxy fields
containing at least 10 stars. We perform a number of diagnostic tests
indicating that the remaining level of systematics is consistent with zero for
the GEMS and GOODS data confirming the success of our PSF correction scheme.
For the parallel data we detect a low level of remaining systematics which we
interpret to be caused by a lack of sufficient dithering of the data.
Combining the shear estimate of the GEMS and GOODS observations using 96
galaxies arcmin^{-2} with the photometric redshift catalogue of the GOODS-MUSIC
sample, we determine a local single field estimate for the mass power spectrum
normalisation sigma_{8,CDFS}=0.52^{+0.11}_{-0.15} (stat) +/- 0.07 (sys) (68%
confidence assuming Gaussian cosmic variance) at fixed Omega_m=0.3 for a
LambdaCDM cosmology. We interpret this exceptionally low estimate to be due to
a local under-density of the foreground structures in the CDFS.Comment: Version accepted for publication in Astronomy & Astrophysics with 28
pages, 25 figures. A version with full resolution figures can be downloaded
from http://www.astro.uni-bonn.de/~schrabba/papers/cosmic_shear_acs1_v2.pd
Assessment of Stability in Partitional Clustering Using Resampling Techniques
The assessment of stability in cluster analysis is strongly related to the main difficult problem of determining the number of clusters present in the data. The latter is subject of many investigations and papers considering different resampling techniques as practical tools. In this paper, we consider non-parametric resampling from the empirical distribution of a given dataset in order to investigate the stability of results of partitional clustering. In detail, we investigate here only the very popular K-means method. The estimation of the sampling distribution of the adjusted Rand index (ARI) and the averaged Jaccard index seems to be the most general way to do this. In addition, we compare bootstrapping with different subsampling schemes (i.e., with different cardinality of the drawn samples) with respect to their performance in finding the true number of clusters for both synthetic and real data
Inferential stability in systems biology
The modern biological sciences are fraught with statistical difficulties. Biomolecular
stochasticity, experimental noise, and the “large p, small n” problem all contribute to
the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful
conclusions from observations. In this thesis, we explore methods for assessing
the effects of data variability upon downstream inference, in an attempt to quantify and
promote the stability of the inferences we make.
We start with a review of existing methods for addressing this problem, focusing upon the
bootstrap and similar methods. The key requirement for all such approaches is a statistical
model that approximates the data generating process.
We move on to consider biomarker discovery problems. We present a novel algorithm for
proposing putative biomarkers on the strength of both their predictive ability and the stability
with which they are selected. In a simulation study, we find our approach to perform
favourably in comparison to strategies that select on the basis of predictive performance
alone.
We then consider the real problem of identifying protein peak biomarkers for HAM/TSP,
an inflammatory condition of the central nervous system caused by HTLV-1 infection.
We apply our algorithm to a set of SELDI mass spectral data, and identify a number of
putative biomarkers. Additional experimental work, together with known results from the
literature, provides corroborating evidence for the validity of these putative biomarkers.
Having focused on static observations, we then make the natural progression to time
course data sets. We propose a (Bayesian) bootstrap approach for such data, and then
apply our method in the context of gene network inference and the estimation of parameters
in ordinary differential equation models. We find that the inferred gene networks
are relatively unstable, and demonstrate the importance of finding distributions of ODE
parameter estimates, rather than single point estimates
- …