3 research outputs found

    Ensemble Estimation of Information Divergence

    Get PDF
    Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem

    Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer

    Get PDF
    Clustering is used widely in ‘omics’ studies and is often tackled with standard methods such as hierarchical clustering or k-means which are limited to a single data type. In addition, these methods are further limited by having to select a cut-off point at specific level of dendrogram- a tree diagram or needing a pre-defined number of clusters respectively. The increasing need for integration of multiple data sets leads to a requirement for clustering methods applicable to mixed data types, where the straightforward application of standard methods is not necessarily the best approach. A particularly common problem involves clustering entities characterized by a mixture of binary data, for example, presence or absence of mutations, binding, motifs, and/or epigenetic marks and continuous data, for example, gene expression, protein abundance and/or metabolite levels. In this work, we presented a generic method based on a probabilistic model for clustering this mixture of data types, and illustrate its application to genetic regulation and the clustering of cancer samples. It uses penalized maximum likelihood (ML) estimation of mixture model parameters using information criteria (model selection objective function) and meta-heuristic searches for optimum clusters. Compatibility of several information criteria with our model-based joint clustering was tested, including the well-known Akaike Information Criterion (AIC) and its empirically determined derivatives (AICλ), Bayesian Information Criterion (BIC) and its derivative (CAIC), and Hannan-Quinn Criterion (HQC). We have experimentally shown with simulated data that AIC and AIC (λ=2.5) worked well with our method. We show that the resulting clusters lead to useful hypotheses: in the case of genetic regulation these concern regulation of groups of genes by specific sets of transcription factors and in the case of cancer samples combinations of gene mutations are related to patterns of gene expression. The clusters have potential mechanistic significance and in the latter case are significantly linked to survival
    corecore