51,026 research outputs found

    Coverage statistics for sequence census methods

    Get PDF
    Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest.Comment: 10 pages, 4 figure

    Towards A Census of Earth-mass Exo-planets with Gravitational Microlensing

    Full text link
    Thirteen exo-planets have been discovered using the gravitational microlensing technique (out of which 7 have been published). These planets already demonstrate that super-Earths (with mass up to ~10 Earth masses) beyond the snow line are common and multiple planet systems are not rare. In this White Paper we introduce the basic concepts of the gravitational microlensing technique, summarise the current mode of discovery and outline future steps towards a complete census of planets including Earth-mass planets. In the near-term (over the next 5 years) we advocate a strategy of automated follow-up with existing and upgraded telescopes which will significantly increase the current planet detection efficiency. In the medium 5-10 year term, we envision an international network of wide-field 2m class telescopes to discover Earth-mass and free-floating exo-planets. In the long (10-15 year) term, we strongly advocate a space microlensing telescope which, when combined with Kepler, will provide a complete census of planets down to Earth mass at almost all separations. Such a survey could be undertaken as a science programme on Euclid, a dark energy probe with a wide-field imager which has been proposed to ESA's Cosmic Vision Programme.Comment: 10 pages. White Paper submission to the ESA Exo-Planet Roadmap Advisory Team. See also "Inferring statistics of planet populations by means of automated microlensing searches" by M. Dominik et al. (arXiv:0808.0004

    Shape-based peak identification for ChIP-Seq

    Get PDF
    We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.htmlComment: 12 pages, 6 figure

    An Empirical Comparison of Multiple Imputation Methods for Categorical Data

    Full text link
    Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. A supplementary material for this article is available online
    • …
    corecore