51,026 research outputs found
Coverage statistics for sequence census methods
Background: We study the statistical properties of fragment coverage in
genome sequencing experiments. In an extension of the classic Lander-Waterman
model, we consider the effect of the length distribution of fragments. We also
introduce the notion of the shape of a coverage function, which can be used to
detect abberations in coverage. The probability theory underlying these
problems is essential for constructing models of current high-throughput
sequencing experiments, where both sample preparation protocols and sequencing
technology particulars can affect fragment length distributions.
Results: We show that regardless of fragment length distribution and under
the mild assumption that fragment start sites are Poisson distributed, the
fragments produced in a sequencing experiment can be viewed as resulting from a
two-dimensional spatial Poisson process. We then study the jump skeleton of the
the coverage function, and show that the induced trees are Galton-Watson trees
whose parameters can be computed.
Conclusions: Our results extend standard analyses of shotgun sequencing that
focus on coverage statistics at individual sites, and provide a null model for
detecting deviations from random coverage in high-throughput sequence census
based experiments. By focusing on fragments, we are also led to a new approach
for visualizing sequencing data that should be of independent interest.Comment: 10 pages, 4 figure
Towards A Census of Earth-mass Exo-planets with Gravitational Microlensing
Thirteen exo-planets have been discovered using the gravitational
microlensing technique (out of which 7 have been published). These planets
already demonstrate that super-Earths (with mass up to ~10 Earth masses) beyond
the snow line are common and multiple planet systems are not rare. In this
White Paper we introduce the basic concepts of the gravitational microlensing
technique, summarise the current mode of discovery and outline future steps
towards a complete census of planets including Earth-mass planets. In the
near-term (over the next 5 years) we advocate a strategy of automated follow-up
with existing and upgraded telescopes which will significantly increase the
current planet detection efficiency. In the medium 5-10 year term, we envision
an international network of wide-field 2m class telescopes to discover
Earth-mass and free-floating exo-planets. In the long (10-15 year) term, we
strongly advocate a space microlensing telescope which, when combined with
Kepler, will provide a complete census of planets down to Earth mass at almost
all separations. Such a survey could be undertaken as a science programme on
Euclid, a dark energy probe with a wide-field imager which has been proposed to
ESA's Cosmic Vision Programme.Comment: 10 pages. White Paper submission to the ESA Exo-Planet Roadmap
Advisory Team. See also "Inferring statistics of planet populations by means
of automated microlensing searches" by M. Dominik et al. (arXiv:0808.0004
Shape-based peak identification for ChIP-Seq
We present a new algorithm for the identification of bound regions from
ChIP-seq experiments. Our method for identifying statistically significant
peaks from read coverage is inspired by the notion of persistence in
topological data analysis and provides a non-parametric approach that is robust
to noise in experiments. Specifically, our method reduces the peak calling
problem to the study of tree-based statistics derived from the data. We
demonstrate the accuracy of our method on existing datasets, and we show that
it can discover previously missed regions and can more clearly discriminate
between multiple binding events. The software T-PIC (Tree shape Peak
Identification for ChIP-Seq) is available at
http://math.berkeley.edu/~vhower/tpic.htmlComment: 12 pages, 6 figure
An Empirical Comparison of Multiple Imputation Methods for Categorical Data
Multiple imputation is a common approach for dealing with missing values in
statistical databases. The imputer fills in missing values with draws from
predictive models estimated from the observed data, resulting in multiple,
completed versions of the database. Researchers have developed a variety of
default routines to implement multiple imputation; however, there has been
limited research comparing the performance of these methods, particularly for
categorical data. We use simulation studies to compare repeated sampling
properties of three default multiple imputation methods for categorical data,
including chained equations using generalized linear models, chained equations
using classification and regression trees, and a fully Bayesian joint
distribution based on Dirichlet Process mixture models. We base the simulations
on categorical data from the American Community Survey. In the circumstances of
this study, the results suggest that default chained equations approaches based
on generalized linear models are dominated by the default regression tree and
Bayesian mixture model approaches. They also suggest competing advantages for
the regression tree and Bayesian mixture model approaches, making both
reasonable default engines for multiple imputation of categorical data. A
supplementary material for this article is available online
- …