24,360 research outputs found
Determining the Number of Samples Required to Estimate Entropy in Natural Sequences
Calculating the Shannon entropy for symbolic sequences has been widely
considered in many fields. For descriptive statistical problems such as
estimating the N-gram entropy of English language text, a common approach is to
use as much data as possible to obtain progressively more accurate estimates.
However in some instances, only short sequences may be available. This gives
rise to the question of how many samples are needed to compute entropy. In this
paper, we examine this problem and propose a method for estimating the number
of samples required to compute Shannon entropy for a set of ranked symbolic
natural events. The result is developed using a modified Zipf-Mandelbrot law
and the Dvoretzky-Kiefer-Wolfowitz inequality, and we propose an algorithm
which yields an estimate for the minimum number of samples required to obtain
an estimate of entropy with a given confidence level and degree of accuracy
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Statistical models for families of evolutionary related proteins have
recently gained interest: in particular pairwise Potts models, as those
inferred by the Direct-Coupling Analysis, have been able to extract information
about the three-dimensional structure of folded proteins, and about the effect
of amino-acid substitutions in proteins. These models are typically requested
to reproduce the one- and two-point statistics of the amino-acid usage in a
protein family, {\em i.e.}~to capture the so-called residue conservation and
covariation statistics of proteins of common evolutionary origin. Pairwise
Potts models are the maximum-entropy models achieving this. While being
successful, these models depend on huge numbers of {\em ad hoc} introduced
parameters, which have to be estimated from finite amount of data and whose
biophysical interpretation remains unclear. Here we propose an approach to
parameter reduction, which is based on selecting collective sequence motifs. It
naturally leads to the formulation of statistical sequence models in terms of
Hopfield-Potts models. These models can be accurately inferred using a mapping
to restricted Boltzmann machines and persistent contrastive divergence. We show
that, when applied to protein data, even 20-40 patterns are sufficient to
obtain statistically close-to-generative models. The Hopfield patterns form
interpretable sequence motifs and may be used to clusterize amino-acid
sequences into functional sub-families. However, the distributed collective
nature of these motifs intrinsically limits the ability of Hopfield-Potts
models in predicting contact maps, showing the necessity of developing models
going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR
Maximum entropy models for antibody diversity
Recognition of pathogens relies on families of proteins showing great
diversity. Here we construct maximum entropy models of the sequence repertoire,
building on recent experiments that provide a nearly exhaustive sampling of the
IgM sequences in zebrafish. These models are based solely on pairwise
correlations between residue positions, but correctly capture the higher order
statistical properties of the repertoire. Exploiting the interpretation of
these models as statistical physics problems, we make several predictions for
the collective properties of the sequence ensemble: the distribution of
sequences obeys Zipf's law, the repertoire decomposes into several clusters,
and there is a massive restriction of diversity due to the correlations. These
predictions are completely inconsistent with models in which amino acid
substitutions are made independently at each site, and are in good agreement
with the data. Our results suggest that antibody diversity is not limited by
the sequences encoded in the genome, and may reflect rapid adaptation to
antigenic challenges. This approach should be applicable to the study of the
global properties of other protein families
Entropy-based parametric estimation of spike train statistics
We consider the evolution of a network of neurons, focusing on the asymptotic
behavior of spikes dynamics instead of membrane potential dynamics. The spike
response is not sought as a deterministic response in this context, but as a
conditional probability : "Reading out the code" consists of inferring such a
probability. This probability is computed from empirical raster plots, by using
the framework of thermodynamic formalism in ergodic theory. This gives us a
parametric statistical model where the probability has the form of a Gibbs
distribution. In this respect, this approach generalizes the seminal and
profound work of Schneidman and collaborators. A minimal presentation of the
formalism is reviewed here, while a general algorithmic estimation method is
proposed yielding fast convergent implementations. It is also made explicit how
several spike observables (entropy, rate, synchronizations, correlations) are
given in closed-form from the parametric estimation. This paradigm does not
only allow us to estimate the spike statistics, given a design choice, but also
to compare different models, thus answering comparative questions about the
neural code such as : "are correlations (or time synchrony or a given set of
spike patterns, ..) significant with respect to rate coding only ?" A numerical
validation of the method is proposed and the perspectives regarding spike-train
code analysis are also discussed.Comment: 37 pages, 8 figures, submitte
Nonlinear time-series analysis revisited
In 1980 and 1981, two pioneering papers laid the foundation for what became
known as nonlinear time-series analysis: the analysis of observed
data---typically univariate---via dynamical systems theory. Based on the
concept of state-space reconstruction, this set of methods allows us to compute
characteristic quantities such as Lyapunov exponents and fractal dimensions, to
predict the future course of the time series, and even to reconstruct the
equations of motion in some cases. In practice, however, there are a number of
issues that restrict the power of this approach: whether the signal accurately
and thoroughly samples the dynamics, for instance, and whether it contains
noise. Moreover, the numerical algorithms that we use to instantiate these
ideas are not perfect; they involve approximations, scale parameters, and
finite-precision arithmetic, among other things. Even so, nonlinear time-series
analysis has been used to great advantage on thousands of real and synthetic
data sets from a wide variety of systems ranging from roulette wheels to lasers
to the human heart. Even in cases where the data do not meet the mathematical
or algorithmic requirements to assure full topological conjugacy, the results
of nonlinear time-series analysis can be helpful in understanding,
characterizing, and predicting dynamical systems
Approximations of Algorithmic and Structural Complexity Validate Cognitive-behavioural Experimental Results
We apply methods for estimating the algorithmic complexity of sequences to
behavioural sequences of three landmark studies of animal behavior each of
increasing sophistication, including foraging communication by ants, flight
patterns of fruit flies, and tactical deception and competition strategies in
rodents. In each case, we demonstrate that approximations of Logical Depth and
Kolmogorv-Chaitin complexity capture and validate previously reported results,
in contrast to other measures such as Shannon Entropy, compression or ad hoc.
Our method is practically useful when dealing with short sequences, such as
those often encountered in cognitive-behavioural research. Our analysis
supports and reveals non-random behavior (LD and K complexity) in flies even in
the absence of external stimuli, and confirms the "stochastic" behaviour of
transgenic rats when faced that they cannot defeat by counter prediction. The
method constitutes a formal approach for testing hypotheses about the
mechanisms underlying animal behaviour.Comment: 28 pages, 7 figures and 2 table
Microbiome profiling by Illumina sequencing of combinatorial sequence-tagged PCR products
We developed a low-cost, high-throughput microbiome profiling method that
uses combinatorial sequence tags attached to PCR primers that amplify the rRNA
V6 region. Amplified PCR products are sequenced using an Illumina paired-end
protocol to generate millions of overlapping reads. Combinatorial sequence
tagging can be used to examine hundreds of samples with far fewer primers than
is required when sequence tags are incorporated at only a single end. The
number of reads generated permitted saturating or near-saturating analysis of
samples of the vaginal microbiome. The large number of reads al- lowed an
in-depth analysis of errors, and we found that PCR-induced errors composed the
vast majority of non-organism derived species variants, an ob- servation that
has significant implications for sequence clustering of similar high-throughput
data. We show that the short reads are sufficient to assign organisms to the
genus or species level in most cases. We suggest that this method will be
useful for the deep sequencing of any short nucleotide region that is
taxonomically informative; these include the V3, V5 regions of the bac- terial
16S rRNA genes and the eukaryotic V9 region that is gaining popularity for
sampling protist diversity.Comment: 28 pages, 13 figure
- …