8,142 research outputs found

    Bootstrapping estimates of stability for clusters, observations and model selection

    Get PDF
    Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator’s trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN)

    Cosmic shear analysis of archival HST/ACS data: I. Comparison of early ACS pure parallel data to the HST/GEMS Survey

    Get PDF
    This is the first paper of a series describing our measurement of weak lensing by large-scale structure using archival observations from the Advanced Camera for Surveys (ACS) on board the Hubble Space Telescope (HST). In this work we present results from a pilot study testing the capabilities of the ACS for cosmic shear measurements with early parallel observations and presenting a re-analysis of HST/ACS data from the GEMS survey and the GOODS observations of the Chandra Deep Field South (CDFS). We describe our new correction scheme for the time-dependent ACS PSF based on observations of stellar fields. This is currently the only technique which takes the full time variation of the PSF between individual ACS exposures into account. We estimate that our PSF correction scheme reduces the systematic contribution to the shear correlation functions due to PSF distortions to < 2*10^{-6} for galaxy fields containing at least 10 stars. We perform a number of diagnostic tests indicating that the remaining level of systematics is consistent with zero for the GEMS and GOODS data confirming the success of our PSF correction scheme. For the parallel data we detect a low level of remaining systematics which we interpret to be caused by a lack of sufficient dithering of the data. Combining the shear estimate of the GEMS and GOODS observations using 96 galaxies arcmin^{-2} with the photometric redshift catalogue of the GOODS-MUSIC sample, we determine a local single field estimate for the mass power spectrum normalisation sigma_{8,CDFS}=0.52^{+0.11}_{-0.15} (stat) +/- 0.07 (sys) (68% confidence assuming Gaussian cosmic variance) at fixed Omega_m=0.3 for a LambdaCDM cosmology. We interpret this exceptionally low estimate to be due to a local under-density of the foreground structures in the CDFS.Comment: Version accepted for publication in Astronomy & Astrophysics with 28 pages, 25 figures. A version with full resolution figures can be downloaded from http://www.astro.uni-bonn.de/~schrabba/papers/cosmic_shear_acs1_v2.pd

    Assessment of Stability in Partitional Clustering Using Resampling Techniques

    Get PDF
    The assessment of stability in cluster analysis is strongly related to the main difficult problem of determining the number of clusters present in the data. The latter is subject of many investigations and papers considering different resampling techniques as practical tools. In this paper, we consider non-parametric resampling from the empirical distribution of a given dataset in order to investigate the stability of results of partitional clustering. In detail, we investigate here only the very popular K-means method. The estimation of the sampling distribution of the adjusted Rand index (ARI) and the averaged Jaccard index seems to be the most general way to do this. In addition, we compare bootstrapping with different subsampling schemes (i.e., with different cardinality of the drawn samples) with respect to their performance in finding the true number of clusters for both synthetic and real data

    Resampling Methods for Unsupervised Learning from Sample Data

    Get PDF

    Inferential stability in systems biology

    Get PDF
    The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the “large p, small n” problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

    Data analytical stability of measuring brain activation in fMRI studies

    Get PDF
    • …
    corecore