45 research outputs found
Estimation of Distribution Overlap of Urn Models
A classical problem in statistics is estimating the expected coverage of a
sample, which has had applications in gene expression, microbial ecology,
optimization, and even numismatics. Here we consider a related extension of
this problem to random samples of two discrete distributions. Specifically, we
estimate what we call the dissimilarity probability of a sample, i.e., the
probability of a draw from one distribution not being observed in k draws from
another distribution. We show our estimator of dissimilarity to be a
U-statistic and a uniformly minimum variance unbiased estimator of
dissimilarity over the largest appropriate range of k. Furthermore, despite the
non-Markovian nature of our estimator when applied sequentially over k, we show
it converges uniformly in probability to the dissimilarity parameter, and we
present criteria when it is approximately normally distributed and admits a
consistent jackknife estimator of its variance. As proof of concept, we analyze
V35 16S rRNA data to discern between various microbial environments. Other
potential applications concern any situation where dissimilarity of two
discrete distributions may be of interest. For instance, in SELEX experiments,
each urn could represent a random RNA pool and each draw a possible solution to
a particular binding site problem over that pool. The dissimilarity of these
pools is then related to the probability of finding binding site solutions in
one pool that are absent in the other.Comment: 27 pages, 4 figure
Hidden Independence in Unstructured Probabilistic Models
We describe a novel way to represent the probability distribution of a random binary string as a mixture having a maximally weighted component associated with independent (though not necessarily identically distributed) Bernoulli characters. We refer to this as the latent independent weight of the probabilistic source producing the string, and derive a combinatorial algorithm to compute it. The decomposition we propose may serve as an alternative to the Boolean paradigm of hypothesis testing, or to assess the fraction of uncorrupted samples originating from a source with independent marginal distributions. In this sense, the latent independent weight quantifies the maximal amount of independence contained within a probabilistic source, which, properly speaking, may not have independent marginal distributions
On Contamination of Symbolic Datasets
Data taking values on discrete sample spaces are the embodiment of modern
biological research. "Omics" experiments produce millions of symbolic outcomes
in the form of reads (i.e., DNA sequences of a few dozens to a few hundred
nucleotides). Unfortunately, these intrinsically non-numerical datasets are
often highly contaminated, and the possible sources of contamination are
usually poorly characterized. This contrasts with numerical datasets where
Gaussian-type noise is often well-justified. To overcome this hurdle, we
introduce the notion of latent weight, which measures the largest expected
fraction of samples from a contaminated probabilistic source that conform to a
model in a well-structured class of desired models. We examine various
properties of latent weights, which we specialize to the class of exchangeable
probability distributions. As proof of concept, we analyze DNA methylation data
from the 22 human autosome pairs. Contrary to what it is usually assumed, we
provide strong evidence that highly specific methylation patterns are
overrepresented at some genomic locations when contamination is taken into
account.Comment: 18 pages, 4 figures, 1 tabl
The diameter of random Cayley digraphs of given degree
We consider random Cayley digraphs of order with uniformly distributed
generating set of size . Specifically, we are interested in the asymptotics
of the probability such a Cayley digraph has diameter two as and
. We find a sharp phase transition from 0 to 1 at around . In particular, if is asymptotically linear in , the
probability converges exponentially fast to 1.Comment: 11 page
Recommended from our members
Quantum Computation
A quantum computer is a machine that exploits quantum phenomena to store information and perform computations.
The chief goal of this article is to provide a brief but comprehensive introduction to quantum computing. It overviews some mathematical underpinnings of quantum computation for readers with only a basic knowledge of linear algebra and probability. However, it does not attempt to be exhaustive nor make an exposition of quantum physics. It further does not attempt to stay current with physical implementations of quantum computers.
After providing a brief historical background, the article introduces the notion of a quantum bit (qubit) and the linear operators (gates) that act on these. It then addresses systems of multiple qubits and their corresponding gates. Along the way, the article covers the essential concepts of separable systems, as well quantum interference and decoherence. It also describes how to represent gates as quantum circuits. To conclude, it explains the underpinnings of two simple but insightful quantum algorithms. A final section suggests further readings for those who wish to delve deeper into quantum computing
A communal catalogue reveals Earth's multiscale microbial diversity
Our growing awareness of the microbial world's importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth's microbial diversity.Peer reviewe
A communal catalogue reveals Earth’s multiscale microbial diversity
Our growing awareness of the microbial world’s importance and diversity contrasts starkly with our limited understanding of its fundamental structure. Despite recent advances in DNA sequencing, a lack of standardized protocols and common analytical frameworks impedes comparisons among studies, hindering the development of global inferences about microbial life on Earth. Here we present a meta-analysis of microbial community samples collected by hundreds of researchers for the Earth Microbiome Project. Coordinated protocols and new analytical methods, particularly the use of exact sequences instead of clustered operational taxonomic units, enable bacterial and archaeal ribosomal RNA gene sequences to be followed across multiple studies and allow us to explore patterns of diversity at an unprecedented scale. The result is both a reference database giving global context to DNA sequence data and a framework for incorporating data from future studies, fostering increasingly complete characterization of Earth’s microbial diversity