40 research outputs found

    Simplifying the mosaic description of DNA sequences

    Get PDF
    By using the Jensen-Shannon divergence, genomic DNA can be divided into compositionally distinct domains through a standard recursive segmentation procedure. Each domain, while significantly different from its neighbours, may however share compositional similarity with one or more distant (non--neighbouring) domains. We thus obtain a coarse--grained description of the given DNA string in terms of a smaller set of distinct domain labels. This yields a minimal domain description of a given DNA sequence, significantly reducing its organizational complexity. This procedure gives a new means of evaluating genomic complexity as one examines organisms ranging from bacteria to human. The mosaic organization of DNA sequences could have originated from the insertion of fragments of one genome (the parasite) inside another (the host), and we present numerical experiments that are suggestive of this scenario.Comment: 16 pages, 1 figure, Accepted for publication in Phys. Rev.

    Finite-sample frequency distributions originating from an equiprobability distribution

    Full text link
    Given an equidistribution for probabilities p(i)=1/N, i=1..N. What is the expected corresponding rank ordered frequency distribution f(i), i=1..N, if an ensemble of M events is drawn?Comment: 4 pages, 4 figure

    New stopping criteria for segmenting DNA sequences

    Get PDF
    We propose a solution on the stopping criterion in segmenting inhomogeneous DNA sequences with complex statistical patterns. This new stopping criterion is based on Bayesian Information Criterion (BIC) in the model selection framework. When this stopping criterion is applied to a left telomere sequence of yeast Saccharomyces cerevisiae and the complete genome sequence of bacterium Escherichia coli, borders of biologically meaningful units were identified (e.g. subtelomeric units, replication origin, and replication terminus), and a more reasonable number of domains was obtained. We also introduce a measure called segmentation strength which can be used to control the delineation of large domains. The relationship between the average domain size and the threshold of segmentation strength is determined for several genome sequences.Comment: 4 pages, 4 figures, Physical Review Letters, to appea

    Phase Transition in a Random Fragmentation Problem with Applications to Computer Science

    Full text link
    We study a fragmentation problem where an initial object of size x is broken into m random pieces provided x>x_0 where x_0 is an atomic cut-off. Subsequently the fragmentation process continues for each of those daughter pieces whose sizes are bigger than x_0. The process stops when all the fragments have sizes smaller than x_0. We show that the fluctuation of the total number of splitting events, characterized by the variance, generically undergoes a nontrivial phase transition as one tunes the branching number m through a critical value m=m_c. For m<m_c, the fluctuations are Gaussian where as for m>m_c they are anomalously large and non-Gaussian. We apply this general result to analyze two different search algorithms in computer science.Comment: 5 pages RevTeX, 3 figures (.eps

    Effect of extreme data loss on long-range correlated and anti-correlated signals quantified by detrended fluctuation analysis

    Full text link
    We investigate how extreme loss of data affects the scaling behavior of long-range power-law correlated and anti-correlated signals applying the DFA method. We introduce a segmentation approach to generate surrogate signals by randomly removing data segments from stationary signals with different types of correlations. These surrogate signals are characterized by: (i) the DFA scaling exponent α\alpha of the original correlated signal, (ii) the percentage pp of the data removed, (iii) the average length μ\mu of the removed (or remaining) data segments, and (iv) the functional form of the distribution of the length of the removed (or remaining) data segments. We find that the {\it global} scaling exponent of positively correlated signals remains practically unchanged even for extreme data loss of up to 90%. In contrast, the global scaling of anti-correlated signals changes to uncorrelated behavior even when a very small fraction of the data is lost. These observations are confirmed on the examples of human gait and commodity price fluctuations. We systematically study the {\it local} scaling behavior of signals with missing data to reveal deviations across scales. We find that for anti-correlated signals even 10% of data loss leads to deviations in the local scaling at large scales from the original anti-correlated towards uncorrelated behavior. In contrast, positively correlated signals show no observable changes in the local scaling for up to 65% of data loss, while for larger percentage, the local scaling shows overestimated regions (with higher local exponent) at small scales, followed by underestimated regions (with lower local exponent) at large scales. Finally, we investigate how the scaling is affected by the statistics of the remaining data segments in comparison to the removed segments

    Heuristic Segmentation of a Nonstationary Time Series

    Full text link
    Many phenomena, both natural and human-influenced, give rise to signals whose statistical properties change under time translation, i.e., are nonstationary. For some practical purposes, a nonstationary time series can be seen as a concatenation of stationary segments. Using a segmentation algorithm, it has been reported that for heart beat data and Internet traffic fluctuations--the distribution of durations of these stationary segments decays with a power law tail. A potential technical difficulty that has not been thoroughly investigated is that a nonstationary time series with a (scale-free) power law distribution of stationary segments is harder to segment than other nonstationary time series because of the wider range of possible segment sizes. Here, we investigate the validity of a heuristic segmentation algorithm recently proposed by Bernaola-Galvan et al. by systematically analyzing surrogate time series with different statistical properties. We find that if a given nonstationary time series has stationary periods whose size is distributed as a power law, the algorithm can split the time series into a set of stationary segments with the correct statistical properties. We also find that the estimated power law exponent of the distribution of stationary-segment sizes is affected by (i) the minimum segment size, and (ii) the ratio of the standard deviation of the mean values of the segments, and the standard deviation of the fluctuations within a segment. Furthermore, we determine that the performance of the algorithm is generally not affected by uncorrelated noise spikes or by weak long-range temporal correlations of the fluctuations within segments.Comment: 23 pages, 14 figure

    OcculterCut: A comprehensive survey of AT-rich regions in fungal genomes.

    Get PDF
    We present a novel method to measure the local GC-content bias in genomes and a survey of published fungal species. The method, enacted as "OcculterCut" (https://sourceforge.net/projects/occultercut), identified species containing distinct AT-rich regions. In most fungal taxa, AT-rich regions are a signature of repeat-induced point mutation (RIP), which targets repetitive DNA and decreases GC-content though the conversion of cytosine to thymine bases. RIP has in turn been identified as a driver of fungal genome evolution, as RIP mutations can also occur in single-copy genes neighbouring repeat-rich regions. Over time RIP perpetuates 'two speeds' of gene evolution in the GC-equilibrated and AT-rich regions of fungal genomes. In this study, genomes showing evidence of this process are found to be common, particularly among the Pezizomycotina. Further analysis highlighted differences in amino acid composition and putative functions of genes from these regions, supporting the hypothesis that these regions play an important role in fungal evolution. OcculterCut can also be used to identify genes undergoing RIP-assisted diversifying selection, such as small, secreted effector proteins that mediate host-microbe disease interactions

    Scale Invariance in the Nonstationarity of Physiological Signals

    Full text link
    We introduce a segmentation algorithm to probe temporal organization of heterogeneities in human heartbeat interval time series. We find that the lengths of segments with different local values of heart rates follow a power-law distribution. This scale-invariant structure is not a simple consequence of the long-range correlations present in the data. We also find that the differences in mean heart rates between consecutive segments display a common functional form, but with different parameters for healthy individuals and for patients with heart failure. This finding may provide information into the way heart rate variability is reduced in cardiac disease.Comment: 13 pages, 5 figures, corrected typo

    Stable Distributions in Stochastic Fragmentation

    Full text link
    We investigate a class of stochastic fragmentation processes involving stable and unstable fragments. We solve analytically for the fragment length density and find that a generic algebraic divergence characterizes its small-size tail. Furthermore, the entire range of acceptable values of decay exponent consistent with the length conservation can be realized. We show that the stochastic fragmentation process is non-self-averaging as moments exhibit significant sample-to-sample fluctuations. Additionally, we find that the distributions of the moments and of extremal characteristics possess an infinite set of progressively weaker singularities.Comment: 11 pages, 5 figure

    WordCluster: detecting clusters of DNA words and genomic elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many <it>k-</it>mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds.</p> <p>Results</p> <p>We introduce here an algorithm to detect clusters of DNA words (<it>k-</it>mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used <it>WordCluster </it>to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.</p> <p>Conclusions</p> <p><it>WordCluster </it>seems to predict biological meaningful clusters of DNA words (<it>k-</it>mers) and genomic entities. The implementation of the method into a web server is available at <url>http://bioinfo2.ugr.es/wordCluster/wordCluster.php</url> including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.</p
    corecore