68,473 research outputs found

    On clustering procedures and nonparametric mixture estimation

    Full text link
    This paper deals with nonparametric estimation of conditional den-sities in mixture models in the case when additional covariates are available. The proposed approach consists of performing a prelim-inary clustering algorithm on the additional covariates to guess the mixture component of each observation. Conditional densities of the mixture model are then estimated using kernel density estimates ap-plied separately to each cluster. We investigate the expected L 1 -error of the resulting estimates and derive optimal rates of convergence over classical nonparametric density classes provided the clustering method is accurate. Performances of clustering algorithms are measured by the maximal misclassification error. We obtain upper bounds of this quantity for a single linkage hierarchical clustering algorithm. Lastly, applications of the proposed method to mixture models involving elec-tricity distribution data and simulated data are presented

    Evolutionary constraints on the complexity of genetic regulatory networks allow predictions of the total number of genetic interactions

    Full text link
    Genetic regulatory networks (GRNs) have been widely studied, yet there is a lack of understanding with regards to the final size and properties of these networks, mainly due to no network currently being complete. In this study, we analyzed the distribution of GRN structural properties across a large set of distinct prokaryotic organisms and found a set of constrained characteristics such as network density and number of regulators. Our results allowed us to estimate the number of interactions that complete networks would have, a valuable insight that could aid in the daunting task of network curation, prediction, and validation. Using state-of-the-art statistical approaches, we also provided new evidence to settle a previously stated controversy that raised the possibility of complete biological networks being random and therefore attributing the observed scale-free properties to an artifact emerging from the sampling process during network discovery. Furthermore, we identified a set of properties that enabled us to assess the consistency of the connectivity distribution for various GRNs against different alternative statistical distributions. Our results favor the hypothesis that highly connected nodes (hubs) are not a consequence of network incompleteness. Finally, an interaction coverage computed for the GRNs as a proxy for completeness revealed that high-throughput based reconstructions of GRNs could yield biased networks with a low average clustering coefficient, showing that classical targeted discovery of interactions is still needed.Comment: 28 pages, 5 figures, 12 pages supplementary informatio

    Mapping Topographic Structure in White Matter Pathways with Level Set Trees

    Full text link
    Fiber tractography on diffusion imaging data offers rich potential for describing white matter pathways in the human brain, but characterizing the spatial organization in these large and complex data sets remains a challenge. We show that level set trees---which provide a concise representation of the hierarchical mode structure of probability density functions---offer a statistically-principled framework for visualizing and analyzing topography in fiber streamlines. Using diffusion spectrum imaging data collected on neurologically healthy controls (N=30), we mapped white matter pathways from the cortex into the striatum using a deterministic tractography algorithm that estimates fiber bundles as dimensionless streamlines. Level set trees were used for interactive exploration of patterns in the endpoint distributions of the mapped fiber tracks and an efficient segmentation of the tracks that has empirical accuracy comparable to standard nonparametric clustering methods. We show that level set trees can also be generalized to model pseudo-density functions in order to analyze a broader array of data types, including entire fiber streamlines. Finally, resampling methods show the reliability of the level set tree as a descriptive measure of topographic structure, illustrating its potential as a statistical descriptor in brain imaging analysis. These results highlight the broad applicability of level set trees for visualizing and analyzing high-dimensional data like fiber tractography output

    Integrative Model-based clustering of microarray methylation and expression data

    Full text link
    In many fields, researchers are interested in large and complex biological processes. Two important examples are gene expression and DNA methylation in genetics. One key problem is to identify aberrant patterns of these processes and discover biologically distinct groups. In this article we develop a model-based method for clustering such data. The basis of our method involves the construction of a likelihood for any given partition of the subjects. We introduce cluster specific latent indicators that, along with some standard assumptions, impose a specific mixture distribution on each cluster. Estimation is carried out using the EM algorithm. The methods extend naturally to multiple data types of a similar nature, which leads to an integrated analysis over multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore