Search CORE

40 research outputs found

Simplifying the mosaic description of DNA sequences

Author: B. Audit
C.-K. Peng
E.N. Trifonov
I. Grosse
J. Lin
J. Subba Rao
J. Widom
J.V. Braun
P. Bernaola-Galván
P. Bernaola-Galván
P. Bernaola-Galván
R. Román-Roldán
R.K. Azad
Rajeev K. Azad
Ramakrishna Ramaswamy
S. Tiwari
S.V. Buldyrev
V.E. Ramensky
W. Li
W. Li
W. Li
W. Li
W. Li
W. Li
W. Li
W. Li
Wentian Li
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2002
Field of study

By using the Jensen-Shannon divergence, genomic DNA can be divided into compositionally distinct domains through a standard recursive segmentation procedure. Each domain, while significantly different from its neighbours, may however share compositional similarity with one or more distant (non--neighbouring) domains. We thus obtain a coarse--grained description of the given DNA string in terms of a smaller set of distinct domain labels. This yields a minimal domain description of a given DNA sequence, significantly reducing its organizational complexity. This procedure gives a new means of evaluating genomic complexity as one examines organisms ranging from bacteria to human. The mosaic organization of DNA sequences could have originated from the insertion of fragments of one genome (the parasite) inside another (the host), and we present numerical experiments that are suggestive of this scenario.Comment: 16 pages, 1 figure, Accepted for publication in Phys. Rev.

arXiv.org e-Print Archive

Crossref

CERN Document Server

Finite-sample frequency distributions originating from an equiprobability distribution

Author: A. O. Schmitt
C.-K. Peng
D. Holste
H. Herzel
Jan A. Freund
P. Allegrini
P. Bernaola-Galván
T. Pöschel
Thorsten Pöschel
Publication venue: 'American Physical Society (APS)'
Publication date: 19/03/2002
Field of study

Given an equidistribution for probabilities p(i)=1/N, i=1..N. What is the expected corresponding rank ordered frequency distribution f(i), i=1..N, if an ensemble of M events is drawn?Comment: 4 pages, 4 figure

arXiv.org e-Print Archive

Crossref

New stopping criteria for segmenting DNA sequences

Author: A. Audit
A. E. Raftery
A. W. F. Edwards
B. James
C. K. Peng
F. R. Blattner
G. Bernardi
G. K. Zipf
G. Schwartz
H. Jeffreys
J. L. Oliver
J. Lin
J. V. Braum
K. P. Burnham
M. Johnson
M. V. Olson
P. Bernaola-Galván
P. Bernaola-Galván
P. Carpena
P. L. Krapivsky
R. F. Voss
R. Román-Roldán
S. Redney
V. E. Ramensky
W. Li
W. Li
W. Li
W. Li
W. Li
W. Li
Wentian Li
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2001
Field of study

We propose a solution on the stopping criterion in segmenting inhomogeneous DNA sequences with complex statistical patterns. This new stopping criterion is based on Bayesian Information Criterion (BIC) in the model selection framework. When this stopping criterion is applied to a left telomere sequence of yeast Saccharomyces cerevisiae and the complete genome sequence of bacterium Escherichia coli, borders of biologically meaningful units were identified (e.g. subtelomeric units, replication origin, and replication terminus), and a more reasonable number of domains was obtained. We also introduce a measure called segmentation strength which can be used to control the delineation of large domains. The relationship between the average domain size and the threshold of segmentation strength is determined for several genome sequences.Comment: 4 pages, 4 figures, Physical Review Letters, to appea

arXiv.org e-Print Archive

Crossref

CERN Document Server

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Phase Transition in a Random Fragmentation Problem with Applications to Computer Science

Author: Bernaola-Galván P
David S Dean
Derrida B
Finkel R A
Flyvbjerg H
Knuth D E
Krapivsky P L
Lawn B R
Mahmoud H M
Newman W I
Redner S
Satya N Majumdar
Turcotte D L
Publication venue: 'IOP Publishing'
Publication date: 02/05/2002
Field of study

We study a fragmentation problem where an initial object of size x is broken into m random pieces provided x>x_0 where x_0 is an atomic cut-off. Subsequently the fragmentation process continues for each of those daughter pieces whose sizes are bigger than x_0. The process stops when all the fragments have sizes smaller than x_0. We show that the fluctuation of the total number of splitting events, characterized by the variance, generically undergoes a nontrivial phase transition as one tunes the branching number m through a critical value m=m_c. For m<m_c, the fluctuations are Gaussian where as for m>m_c they are anomalously large and non-Gaussian. We apply this general result to analyze two different search algorithms in computer science.Comment: 5 pages RevTeX, 3 figures (.eps

arXiv.org e-Print Archive

Crossref

Effect of extreme data loss on long-range correlated and anti-correlated signals quantified by detrended fluctuation analysis

Author: H. E. Hurst
J. B. Bassingthwaighte
J. M. Hausdorff
K. K. L. Ho
M. Malik
Mitsuru Yoneyama
Pedro Bernaola-Galván
Plamen Ch. Ivanov
Qianli D. Y. Ma
R. L. Stratonovich
Ronny P. Bartsch
S. M. Pikkujämsä
Y. Ashkenazy
Publication venue: 'American Physical Society (APS)'
Publication date: 01/02/2010
Field of study

We investigate how extreme loss of data affects the scaling behavior of long-range power-law correlated and anti-correlated signals applying the DFA method. We introduce a segmentation approach to generate surrogate signals by randomly removing data segments from stationary signals with different types of correlations. These surrogate signals are characterized by: (i) the DFA scaling exponent

\alpha

of the original correlated signal, (ii) the percentage

p

of the data removed, (iii) the average length

\mu

of the removed (or remaining) data segments, and (iv) the functional form of the distribution of the length of the removed (or remaining) data segments. We find that the {\it global} scaling exponent of positively correlated signals remains practically unchanged even for extreme data loss of up to 90%. In contrast, the global scaling of anti-correlated signals changes to uncorrelated behavior even when a very small fraction of the data is lost. These observations are confirmed on the examples of human gait and commodity price fluctuations. We systematically study the {\it local} scaling behavior of signals with missing data to reveal deviations across scales. We find that for anti-correlated signals even 10% of data loss leads to deviations in the local scaling at large scales from the original anti-correlated towards uncorrelated behavior. In contrast, positively correlated signals show no observable changes in the local scaling for up to 65% of data loss, while for larger percentage, the local scaling shows overestimated regions (with higher local exponent) at small scales, followed by underestimated regions (with lower local exponent) at large scales. Finally, we investigate how the scaling is affected by the statistics of the remaining data segments in comparison to the removed segments

arXiv.org e-Print Archive

Crossref

Heuristic Segmentation of a Nonstationary Time Series

Author: A. Bunde
A.L. Goldberger
B. Efron
B.M. Hill
C.-K. Peng
H. Eugene Stanley
H.A. Makse
H.E. Stanley
Kensuke Fukuda
Luís A. Nunes Amaral
M. Crovella
M. Takayasu
M. Takayasu
P. Bernaola-Galván
P. Gopikrishnan
P.Ch. Ivanov
P.Ch. Ivanov
T. Musha
V. Paxson
W.E. Leland
Z.R. Struzik
Publication venue: 'American Physical Society (APS)'
Publication date: 04/08/2003
Field of study

Many phenomena, both natural and human-influenced, give rise to signals whose statistical properties change under time translation, i.e., are nonstationary. For some practical purposes, a nonstationary time series can be seen as a concatenation of stationary segments. Using a segmentation algorithm, it has been reported that for heart beat data and Internet traffic fluctuations--the distribution of durations of these stationary segments decays with a power law tail. A potential technical difficulty that has not been thoroughly investigated is that a nonstationary time series with a (scale-free) power law distribution of stationary segments is harder to segment than other nonstationary time series because of the wider range of possible segment sizes. Here, we investigate the validity of a heuristic segmentation algorithm recently proposed by Bernaola-Galvan et al. by systematically analyzing surrogate time series with different statistical properties. We find that if a given nonstationary time series has stationary periods whose size is distributed as a power law, the algorithm can split the time series into a set of stationary segments with the correct statistical properties. We also find that the estimated power law exponent of the distribution of stationary-segment sizes is affected by (i) the minimum segment size, and (ii) the ratio of the standard deviation of the mean values of the segments, and the standard deviation of the fluctuations within a segment. Furthermore, we determine that the performance of the algorithm is generally not affected by uncorrelated noise spikes or by weak long-range temporal correlations of the fluctuations within segments.Comment: 23 pages, 14 figure

arXiv.org e-Print Archive

Crossref

OcculterCut: A comprehensive survey of AT-rich regions in fungal genomes.

Author: Alison C. Testa
Baker
Bernaola-Galván
Cambareri
Fudal
Gioti
Hane
Irelan
James K. Hane
Link
NCBI Resource Coordinators
Richard P. Oliver
Storck
Thomma
Tzeng
Watters
Xu
Zolan
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

We present a novel method to measure the local GC-content bias in genomes and a survey of published fungal species. The method, enacted as "OcculterCut" (https://sourceforge.net/projects/occultercut), identified species containing distinct AT-rich regions. In most fungal taxa, AT-rich regions are a signature of repeat-induced point mutation (RIP), which targets repetitive DNA and decreases GC-content though the conversion of cytosine to thymine bases. RIP has in turn been identified as a driver of fungal genome evolution, as RIP mutations can also occur in single-copy genes neighbouring repeat-rich regions. Over time RIP perpetuates 'two speeds' of gene evolution in the GC-equilibrated and AT-rich regions of fungal genomes. In this study, genomes showing evidence of this process are found to be common, particularly among the Pezizomycotina. Further analysis highlighted differences in amino acid composition and putative functions of genes from these regions, supporting the hypothesis that these regions play an important role in fungal evolution. OcculterCut can also be used to identify genes undergoing RIP-assisted diversifying selection, such as small, secreted effector proteins that mediate host-microbe disease interactions

Crossref

PubMed Central

espace@Curtin

Scale Invariance in the Nonstationarity of Physiological Signals

Author: A. Arneodo
A. L. Goldberger
A. Witt
C. Guilleminault
C.-K. Peng
G. Mayer-Kress
H. A. Makse
H. Eugene Stanley
H. Kantz
J. B. Bassingthwaighte
J. M. Hausdorff
L. S. Liebovitch
Luís A. Nunes Amaral
M. Kobayashi
M. M. Wolf
M. Meyer
P. Ch. Ivanov
P. Ch. Ivanov
Pedro Bernaola-Galván
Plamen Ch. Ivanov
R. Hegger
R. I. Kitney
R. L. Stratonovich
T. Schreiber
W. H. Press
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2000
Field of study

We introduce a segmentation algorithm to probe temporal organization of heterogeneities in human heartbeat interval time series. We find that the lengths of segments with different local values of heart rates follow a power-law distribution. This scale-invariant structure is not a simple consequence of the long-range correlations present in the data. We also find that the differences in mean heart rates between consecutive segments display a common functional form, but with different parameters for healthy individuals and for patients with heart failure. This finding may provide information into the way heart rate variability is reduced in cardiac disease.Comment: 13 pages, 5 figures, corrected typo

arXiv.org e-Print Archive

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Stable Distributions in Stochastic Fragmentation

Author: Ben-Naim E
Ben-Naim E
Ben-Naim E
Bernaola-Galván P
Chase K C
Dean D S
Derrida B
Derrida B
Derrida B
Derrida B
E Ben-Naim
Ertas D
Esipov S E
Evans J W
Flyvbjerg H
Frachebourg L
Golomb S W
Golomb S W
Gonzalez J J
Harris T E
Higgs P G
I Grosse
Ishii T
Kantor Y
Kauffman S A
Krapivsky P L
Krapivsky P L
Krapivsky P L
Lawn B R
Li W T
McGhee J D
Mézard M
P L Krapivsky
Redner S
Rényi A
Shepp L A
Shinnar R
Turcotte D L
Publication venue: 'IOP Publishing'
Publication date: 31/08/2001
Field of study

We investigate a class of stochastic fragmentation processes involving stable and unstable fragments. We solve analytically for the fragment length density and find that a generic algebraic divergence characterizes its small-size tail. Furthermore, the entire range of acceptable values of decay exponent consistent with the length conservation can be realized. We show that the stochastic fragmentation process is non-self-averaging as moments exhibit significant sample-to-sample fluctuations. Additionally, we find that the distributions of the moments and of extremal characteristics possess an infinite set of progressively weaker singularities.Comment: 11 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Cold Spring Harbor Laboratory Institutional Repository

WordCluster: detecting clusters of DNA words and genomic elements

Author: A Sandelin
A Siepel
AR Quinlan
B Giardine
D Durand
D Karolchik
Guillermo Barturen
José L Oliver
KD Pruitt
M Ashburner
M Gardiner-Garden
M Hackenberg
M Hackenberg
M Hackenberg
M Hackenberg
Michael Hackenberg
P Carpena
Pedro Bernaola-Galván
Pedro Carpena
R Aloni
R Lister
TJ Hubbard
VJ Makeev
Ángel M Alganza
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Many <it>k-</it>mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (<it>k-</it>mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used <it>WordCluster </it>to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions <it>WordCluster </it>seems to predict biological meaningful clusters of DNA words (<it>k-</it>mers) and genomic entities. The implementation of the method into a web server is available at <url>http://bioinfo2.ugr.es/wordCluster/wordCluster.php</url> including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.</p

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Repositorio Institucional Universidad de Granada