40 research outputs found
Simplifying the mosaic description of DNA sequences
By using the Jensen-Shannon divergence, genomic DNA can be divided into
compositionally distinct domains through a standard recursive segmentation
procedure. Each domain, while significantly different from its neighbours, may
however share compositional similarity with one or more distant
(non--neighbouring) domains. We thus obtain a coarse--grained description of
the given DNA string in terms of a smaller set of distinct domain labels. This
yields a minimal domain description of a given DNA sequence, significantly
reducing its organizational complexity. This procedure gives a new means of
evaluating genomic complexity as one examines organisms ranging from bacteria
to human. The mosaic organization of DNA sequences could have originated from
the insertion of fragments of one genome (the parasite) inside another (the
host), and we present numerical experiments that are suggestive of this
scenario.Comment: 16 pages, 1 figure, Accepted for publication in Phys. Rev.
Finite-sample frequency distributions originating from an equiprobability distribution
Given an equidistribution for probabilities p(i)=1/N, i=1..N. What is the
expected corresponding rank ordered frequency distribution f(i), i=1..N, if an
ensemble of M events is drawn?Comment: 4 pages, 4 figure
New stopping criteria for segmenting DNA sequences
We propose a solution on the stopping criterion in segmenting inhomogeneous
DNA sequences with complex statistical patterns. This new stopping criterion is
based on Bayesian Information Criterion (BIC) in the model selection framework.
When this stopping criterion is applied to a left telomere sequence of yeast
Saccharomyces cerevisiae and the complete genome sequence of bacterium
Escherichia coli, borders of biologically meaningful units were identified
(e.g. subtelomeric units, replication origin, and replication terminus), and a
more reasonable number of domains was obtained. We also introduce a measure
called segmentation strength which can be used to control the delineation of
large domains. The relationship between the average domain size and the
threshold of segmentation strength is determined for several genome sequences.Comment: 4 pages, 4 figures, Physical Review Letters, to appea
Phase Transition in a Random Fragmentation Problem with Applications to Computer Science
We study a fragmentation problem where an initial object of size x is broken
into m random pieces provided x>x_0 where x_0 is an atomic cut-off.
Subsequently the fragmentation process continues for each of those daughter
pieces whose sizes are bigger than x_0. The process stops when all the
fragments have sizes smaller than x_0. We show that the fluctuation of the
total number of splitting events, characterized by the variance, generically
undergoes a nontrivial phase transition as one tunes the branching number m
through a critical value m=m_c. For m<m_c, the fluctuations are Gaussian where
as for m>m_c they are anomalously large and non-Gaussian. We apply this general
result to analyze two different search algorithms in computer science.Comment: 5 pages RevTeX, 3 figures (.eps
Effect of extreme data loss on long-range correlated and anti-correlated signals quantified by detrended fluctuation analysis
We investigate how extreme loss of data affects the scaling behavior of
long-range power-law correlated and anti-correlated signals applying the DFA
method. We introduce a segmentation approach to generate surrogate signals by
randomly removing data segments from stationary signals with different types of
correlations. These surrogate signals are characterized by: (i) the DFA scaling
exponent of the original correlated signal, (ii) the percentage of
the data removed, (iii) the average length of the removed (or remaining)
data segments, and (iv) the functional form of the distribution of the length
of the removed (or remaining) data segments. We find that the {\it global}
scaling exponent of positively correlated signals remains practically unchanged
even for extreme data loss of up to 90%. In contrast, the global scaling of
anti-correlated signals changes to uncorrelated behavior even when a very small
fraction of the data is lost. These observations are confirmed on the examples
of human gait and commodity price fluctuations. We systematically study the
{\it local} scaling behavior of signals with missing data to reveal deviations
across scales. We find that for anti-correlated signals even 10% of data loss
leads to deviations in the local scaling at large scales from the original
anti-correlated towards uncorrelated behavior. In contrast, positively
correlated signals show no observable changes in the local scaling for up to
65% of data loss, while for larger percentage, the local scaling shows
overestimated regions (with higher local exponent) at small scales, followed by
underestimated regions (with lower local exponent) at large scales. Finally, we
investigate how the scaling is affected by the statistics of the remaining data
segments in comparison to the removed segments
Heuristic Segmentation of a Nonstationary Time Series
Many phenomena, both natural and human-influenced, give rise to signals whose
statistical properties change under time translation, i.e., are nonstationary.
For some practical purposes, a nonstationary time series can be seen as a
concatenation of stationary segments. Using a segmentation algorithm, it has
been reported that for heart beat data and Internet traffic fluctuations--the
distribution of durations of these stationary segments decays with a power law
tail. A potential technical difficulty that has not been thoroughly
investigated is that a nonstationary time series with a (scale-free) power law
distribution of stationary segments is harder to segment than other
nonstationary time series because of the wider range of possible segment sizes.
Here, we investigate the validity of a heuristic segmentation algorithm
recently proposed by Bernaola-Galvan et al. by systematically analyzing
surrogate time series with different statistical properties. We find that if a
given nonstationary time series has stationary periods whose size is
distributed as a power law, the algorithm can split the time series into a set
of stationary segments with the correct statistical properties. We also find
that the estimated power law exponent of the distribution of stationary-segment
sizes is affected by (i) the minimum segment size, and (ii) the ratio of the
standard deviation of the mean values of the segments, and the standard
deviation of the fluctuations within a segment. Furthermore, we determine that
the performance of the algorithm is generally not affected by uncorrelated
noise spikes or by weak long-range temporal correlations of the fluctuations
within segments.Comment: 23 pages, 14 figure
OcculterCut: A comprehensive survey of AT-rich regions in fungal genomes.
We present a novel method to measure the local GC-content bias in genomes and a survey of published fungal species. The method, enacted as "OcculterCut" (https://sourceforge.net/projects/occultercut), identified species containing distinct AT-rich regions. In most fungal taxa, AT-rich regions are a signature of repeat-induced point mutation (RIP), which targets repetitive DNA and decreases GC-content though the conversion of cytosine to thymine bases. RIP has in turn been identified as a driver of fungal genome evolution, as RIP mutations can also occur in single-copy genes neighbouring repeat-rich regions. Over time RIP perpetuates 'two speeds' of gene evolution in the GC-equilibrated and AT-rich regions of fungal genomes. In this study, genomes showing evidence of this process are found to be common, particularly among the Pezizomycotina. Further analysis highlighted differences in amino acid composition and putative functions of genes from these regions, supporting the hypothesis that these regions play an important role in fungal evolution. OcculterCut can also be used to identify genes undergoing RIP-assisted diversifying selection, such as small, secreted effector proteins that mediate host-microbe disease interactions
Scale Invariance in the Nonstationarity of Physiological Signals
We introduce a segmentation algorithm to probe temporal organization of
heterogeneities in human heartbeat interval time series. We find that the
lengths of segments with different local values of heart rates follow a
power-law distribution. This scale-invariant structure is not a simple
consequence of the long-range correlations present in the data. We also find
that the differences in mean heart rates between consecutive segments display a
common functional form, but with different parameters for healthy individuals
and for patients with heart failure. This finding may provide information into
the way heart rate variability is reduced in cardiac disease.Comment: 13 pages, 5 figures, corrected typo
Stable Distributions in Stochastic Fragmentation
We investigate a class of stochastic fragmentation processes involving stable
and unstable fragments. We solve analytically for the fragment length density
and find that a generic algebraic divergence characterizes its small-size tail.
Furthermore, the entire range of acceptable values of decay exponent consistent
with the length conservation can be realized. We show that the stochastic
fragmentation process is non-self-averaging as moments exhibit significant
sample-to-sample fluctuations. Additionally, we find that the distributions of
the moments and of extremal characteristics possess an infinite set of
progressively weaker singularities.Comment: 11 pages, 5 figure
WordCluster: detecting clusters of DNA words and genomic elements
<p>Abstract</p> <p>Background</p> <p>Many <it>k-</it>mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.</p> <p>Results</p> <p>We introduce here an algorithm to detect clusters of DNA words (<it>k-</it>mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used <it>WordCluster </it>to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.</p> <p>Conclusions</p> <p><it>WordCluster </it>seems to predict biological meaningful clusters of DNA words (<it>k-</it>mers) and genomic entities. The implementation of the method into a web server is available at <url>http://bioinfo2.ugr.es/wordCluster/wordCluster.php</url> including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.</p