Search CORE

59 research outputs found

Oxygen content of transmembrane proteins over macroevolutionary time scales

Author: Acquisti C.
Collins S.
Kleffe J.
Publication venue
Publication date: 01/01/2006
Field of study

Simultaneous identification of long similar substrings in large sets of sequences

Author: A Lefebvre
Burghardt Wittig
E Check
Friedrich Möller
J Kleffe
Jürgen Kleffe
M Abouelhoda
M Hiller
M Höhl
PE Warburton
R Sorek
S Burkhardt
S Kurtz
S Kurtz
S Mielordt
T Hamborg
W Kent
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 <it>Medicago truncatula </it>BAC-size sequences published at <url>http://www.medicago.org/genome/assembly_table.php?chr=1</url>. Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

String Matching and 1d Lattice Gases

Author: A. D. Barbour
A. Dembo
B. Prum
D. Achlioptas
D. E. Knuth
E. Rivals
F. Gürsey
G. E. Uhlenbeck
G. Reinert
H. Harborth
H. S. Wilf
I. Fudos
I. Z. Fisher
J. Kleffe
Jane F. Gentleman
L. Goldstein
L. J. Guibas
L. J. Guibas
L. J. Guibas
M. Mézard
M. Régnier
M. Régnier
M. S. Waterman
M. X. Geske
Muhittin Mungan
O. Chrysaphinou
O. Chrysaphinou
O. Chrysaphinou
P. Pevzner
R. Monasson
S. B. Boyer
S. Karlin
S. Kirkpatrick
S. Robin
S. Robin
S. Robin
S. Schbath
W. Feller
Y. Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/08/2005
Field of study

We calculate the probability distributions for the number of occurrences

n

of a given

l

letter word in a random string of

k

letters. Analytical expressions for the distribution are known for the asymptotic regimes (i)

k \gg r^l \gg 1

(Gaussian) and

k,l \to \infty

such that

k/r^l

is finite (Compound Poisson). However, it is known that these distributions do now work well in the intermediate regime

k \gtrsim r^l \gtrsim 1

. We show that the problem of calculating the string matching probability can be cast into a determining the configurational partition function of a 1d lattice gas with interacting particles so that the matching probability becomes the grand-partition sum of the lattice gas, with the number of particles corresponding to the number of matches. We perform a virial expansion of the effective equation of state and obtain the probability distribution. Our result reproduces the behavior of the distribution in all regimes. We are also able to show analytically how the limiting distributions arise. Our analysis builds on the fact that the effective interactions between the particles consist of a relatively strong core of size

l

, the word length, followed by a weak, exponentially decaying tail. We find that the asymptotic regimes correspond to the case where the tail of the interactions can be neglected, while in the intermediate regime they need to be kept in the analysis. Our results are readily generalized to the case where the random strings are generated by more complicated stochastic processes such as a non-uniform letter probability distribution or Markov chains. We show that in these cases the tails of the effective interactions can be made even more dominant rendering thus the asymptotic approximations less accurate in such a regime.Comment: 44 pages and 8 figures. Major revision of previous version. The lattice gas analogy has been worked out in full, including virial expansion and equation of state. This constitutes the main part of the paper now. Connections with existing work is made and references should be up to date now. To be submitted for publicatio

arXiv.org e-Print Archive

Crossref

EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data

Author: A Christoffels
A Kalyanaraman
A Masoudi-Nejad
B Lee
C Wei
D Karolchik
E Eyras
E Kim
ER Mardis
Ernesto Picardi
Flavio Mignone
G Pertea
G Pesole
GD Schuler
Graziano Pesole
J Burke
J Forment
J Harrow
J Kleffe
J Kleffe
J Parkinson
JP Wang
L Florea
M Arumugam
M de la Bastide
M Stanke
MB Gerstein
MS Boguski
R Apweiler
RT Miller
S Djebali
S Hazelhurst
SF Altschul
SH Nagaraj
SH Nagaraj
T Castrignano
TD Wu
WJ Kent
X Huang
Y Lee
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed <it>loci</it>. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an <it>ad hoc </it>genomic mapping. Methods EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site. Results The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human <it>HOXA </it>gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the <it>Ricinus communis </it>oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Bari

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

The Information Coded in the Yeast Response Elements Accounts for Most of the Topological Properties of Its Transcriptional Regulation Network

Author: A Vazques
A Wagner
A Wagner
AHY Tong
AL Barabasi
AL Barabasi
Alkan Kabakçıoğlu
AS Perelson
Ayşe Erzan
B Alberts
B Bollobas
B Kınıkoğlu
CE Shannon
CT Harbison
D Balcan
DJ Lockhart
DJ Watts
Duygu Balcan
G Caldarelli
Gustavo Stolovitzky
J Ihmels
J Ihmels
J Kleffe
J Watson
M Kellis
M Molloy
M Mungan
MC Teixeira
Muhittin Mungan
N Geard
N Guelzim
NM Luscombe
R Albert
R Dobrin
R Milo
R Pastor-Satorras
RV Sole
S Bergmann
S Carmi
S Huang
S Kauffman
S Kullback
S Zhou
SA Kauffman
SH Strogatz
SN Dorogovstsev
T Reil
TI Lee
V Colizza
V Colizza
V van Noort
W Banzhaf
Y Almirantis
Publication venue: Public Library of Science
Publication date: 27/05/2006
Field of study

The regulation of gene expression in a cell relies to a major extent on transcription factors, proteins which recognize and bind the DNA at specific binding sites (response elements) within promoter regions associated with each gene. We present an information theoretic approach to modeling transcriptional regulatory networks, in terms of a simple “sequence-matching” rule and the statistics of the occurrence of binding sequences of given specificity in random promoter regions. The crucial biological input is the distribution of the amount of information coded in these cognate response elements and the length distribution of the promoter regions. We provide an analysis of the transcriptional regulatory network of yeast Saccharomyces cerevisiae, which we extract from the available databases, with respect to the degree distributions, clustering coefficient, degree correlations, rich-club coefficient and the k-core structure. We find that these topological features are in remarkable agreement with those predicted by our model, on the basis of the amount of information coded in the interaction between the transcription factors and response elements

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

Koç University Digital Collections

List of Texts

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

Author: A Bairoch
A Denise
AC Camproux
Anne-Claude Camproux
AP Godbole
B Prum
C Gautier
DA Benson
DL Antzoulakos
E Rocha
G Churchill
G Nuel
G Nuel
G Nuel
G Nuel
G Nuel
G Nuelg
G Reinert
G Reinert
GD Stormo
Gregory Nuel
J Becq
J Do
J Fu
J Kleffe
J Martin
J Van Helden
JAD Aston
JC Fu
JC Fu
JE Hopcroft
JM Claverie
Juliette Martin
JW Fickett
K Liolios
L Regad
Leslie Regad
M Crochemore
M Reignier
M Thomas-Chollier
MC Frith
ME Lladser
MX Geske
MY Leung
N Hulo
P Nicodème
P Nicolas
P Pevzner
P Ribeca
R Cowan
S Karlin
S Sourice
T Erhardsson
V Boeva
V Boeva
V Stefanov
V Stefanov
VT Stefanov
YM Chang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Conclusions Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p

HAL Evry

Crossref

Springer - Publisher Connector

Directory of Open Access Journals