Search CORE

208 research outputs found

SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences

Author: A Brazma
A Califano
B Brejova
DR Cavener
E Eskin
Fathi Elloumi
FP Roth
G Pavesi
G Thijs
GZ Hertz
H Salgado
I Jonassen
I Rigoutsos
I Rigoutsos
J Van Helden
M Burset
M Tompa
Martha Nason
PA Pevzner
PA Pevzner
R Agrawal
S Sinha
S Sinha
TL Bailey
Y Makita
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Computational methods to predict transcription factor binding sites (TFBS) based on exhaustive algorithms are guaranteed to find the best patterns but are often limited to short ones or impose some constraints on the pattern type. Many patterns for binding sites in prokaryotic species are not well characterized but are known to be large, between 16–30 base pairs (bp) and contain at least 2 conserved bases. The length of prokaryotic species promoters (about 400 bp) and our interest in studying a small set of genes that could be a cluster of co-regulated genes from microarray experiments led to the development of a new exhaustive algorithm targeting these large patterns. Results We present Searchpattool, a new method to search for and select the most specific (conservative) frequent patterns. This method does not impose restrictions on the density or the structure of the pattern. The best patterns (motifs) are selected using several statistics, including a new application of a z-score based on the number of matching sequences. We compared Searchpattool against other well known algorithms on a <it>Bacillus subtilis </it>group of 14 input sequences and found that in our experiments Searchpattool always performed the best based on performance scores. Conclusion Searchpattool is a new method for pattern discovery relative to transcription factor binding sites for species or genes with short promoters. It outputs the most specific significant patterns and helps the biologist to choose the best candidates.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

String Matching and 1d Lattice Gases

Author: A. D. Barbour
A. Dembo
B. Prum
D. Achlioptas
D. E. Knuth
E. Rivals
F. Gürsey
G. E. Uhlenbeck
G. Reinert
H. Harborth
H. S. Wilf
I. Fudos
I. Z. Fisher
J. Kleffe
Jane F. Gentleman
L. Goldstein
L. J. Guibas
L. J. Guibas
L. J. Guibas
M. Mézard
M. Régnier
M. Régnier
M. S. Waterman
M. X. Geske
Muhittin Mungan
O. Chrysaphinou
O. Chrysaphinou
O. Chrysaphinou
P. Pevzner
R. Monasson
S. B. Boyer
S. Karlin
S. Kirkpatrick
S. Robin
S. Robin
S. Robin
S. Schbath
W. Feller
Y. Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/08/2005
Field of study

We calculate the probability distributions for the number of occurrences

n

of a given

l

letter word in a random string of

k

letters. Analytical expressions for the distribution are known for the asymptotic regimes (i)

k \gg r^l \gg 1

(Gaussian) and

k,l \to \infty

such that

k/r^l

is finite (Compound Poisson). However, it is known that these distributions do now work well in the intermediate regime

k \gtrsim r^l \gtrsim 1

. We show that the problem of calculating the string matching probability can be cast into a determining the configurational partition function of a 1d lattice gas with interacting particles so that the matching probability becomes the grand-partition sum of the lattice gas, with the number of particles corresponding to the number of matches. We perform a virial expansion of the effective equation of state and obtain the probability distribution. Our result reproduces the behavior of the distribution in all regimes. We are also able to show analytically how the limiting distributions arise. Our analysis builds on the fact that the effective interactions between the particles consist of a relatively strong core of size

l

, the word length, followed by a weak, exponentially decaying tail. We find that the asymptotic regimes correspond to the case where the tail of the interactions can be neglected, while in the intermediate regime they need to be kept in the analysis. Our results are readily generalized to the case where the random strings are generated by more complicated stochastic processes such as a non-uniform letter probability distribution or Markov chains. We show that in these cases the tails of the effective interactions can be made even more dominant rendering thus the asymptotic approximations less accurate in such a regime.Comment: 44 pages and 8 figures. Major revision of previous version. The lattice gas analogy has been worked out in full, including virial expansion and equation of state. This constitutes the main part of the paper now. Connections with existing work is made and references should be up to date now. To be submitted for publicatio

arXiv.org e-Print Archive

Crossref

A fast algorithm for the multiple genome rearrangement problem with weighted reversals and transpositions

Author: A Bergeron
A Caprara
A Caprara
B Bourque
B Moret
B Moret
D Bader
D Sankoff
D Sankoff
E Tannier
Enno Ohlebusch
G Fritzsch
J Tang
M Bader
M Bader
M Bernt
M Blanchette
M Blanchette
M Cosner
Martin Bader
Mohamed I Abouelhoda
N Eriksen
P Pevzner
S Hannenhalli
S Wu
S Wu
T Hartman
T Liu
V Bafna
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Due to recent progress in genome sequencing, more and more data for phylogenetic reconstruction based on rearrangement distances between genomes become available. However, this phylogenetic reconstruction is a very challenging task. For the most simple distance measures (the breakpoint distance and the reversal distance), the problem is NP-hard even if one considers only three genomes. Results In this paper, we present a new heuristic algorithm that directly constructs a phylogenetic tree w.r.t. the weighted reversal and transposition distance. Experimental results on previously published datasets show that constructing phylogenetic trees in this way results in better trees than constructing the trees w.r.t. the reversal distance, and recalculating the weight of the trees with the weighted reversal and transposition distance. An implementation of the algorithm can be obtained from the authors. Conclusion The possibility of creating phylogenetic trees directly w.r.t. the weighted reversal and transposition distance results in biologically more realistic scenarios. Our algorithm can solve today's most challenging biological datasets in a reasonable amount of time.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Assembly complexity of prokaryotic genomes using short reads

Author: A Guénoche
AR Rubinov
B Bollobás
B Haubold
C Smith
Carl Kingsford
D Gusfield
DH Huson
DR Zerbino
Dvan den Broek
E Myers
EW Myers
I Simon
J Butler
J Parkhill
JAA Quitzau
JC Dohm
JP Hutchinson
JP Hutchinson
M Antoniotti
M Margulies
Michael C Schatz
Mihai Pop
MJ Chaisson
MJ Chaisson
MS Waterman
N de Bruijn
N Whiteford
OG Troyanskaya
P Medvedev
PA Pevzner
PA Pevzner
R Barrangou
R Idury
S Batzoglou
T van Aardenne-Ehrenfest
TD Harris
WR Jeck
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for <it>de novo </it>reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.</p

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Viral population estimation using pyrosequencing

Author: A Dempster
A Rambaut
AMN Tsibris
B Gaschen
Baback Gharizadeh
C Wang
Chunlin Wang
D O'Meara
DC Douek
E Domingo
E Halperin
EH Simpson
ES Lander
Glenn Tesler
GS Gottlieb
GW Tyson
H Fakhrai-Rad
I Malet
IM Rouzine
J Kececioglu
JE Hopcroft
JF Simons
K Chen
KJ Metzner
L Bacheler
L Doukhan
L Excoffier
Lior Pachter
LR Ford
M Breitbart
M Eigen
M Margulies
M Stephens
MA Nowak
MJ Gonzales
ML Collins
ML Sogin
Mostafa Ronaghi
MT Tammi
N Beerenwinkel
Nicholas Eriksson
Niko Beerenwinkel
P Jenkins
PA Pevzner
R Schmid
R Shankarappa
Robert W. Shafer
RP Dilworth
S Huse
S-Y Rhee
S-Y Rhee
Soo-Yon Rhee
VA Johnson
Yumi Mitsuya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2008
Field of study

The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Repository for Publications and Research Data

Crossref

Directory of Open Access Journals

PubMed Central

Caltech Authors

Economic Ideas and Institutional Change: Evidence from Soviet Economic Discourse 1987-1991

Author: ] Popkova
A Makasheva
A Ryvkin
A T Denzau
Aleksandr A Radaev
Aleksandr B Veber
Aleksandr L Shapiro
Aleksandr Levikov
Aleksandr V Kollontai
Aleksei M Emel&apos
Anatolii N Komin
Avansy I Dolgi
B Rakitskii
Bertin
Boris Pinsker
Boris Pinsker
Boris S Pinsker
Demontazha Gmk Ne Bylo I Net
Dennis D Kondrashev
Douglass North
Efim L Manevich
Est&apos
Evgenii Iasin
Evgenii Iasin
Gary S Becker
Gavriil Popov
Gavriil Popov
Gennadii V Gorlanov
Geoffrey M Hodgson
I Lakatos
Iakov Pevzner
Iakov Pevzner
Irina M Osadchaia
Iurii V Borozdin
Iurii V Borozdin
J A Schumpeter
J Zweynert
Joachim Zweynert
Joachim Zweynert
Joachim Zweynert
Joseph Schull
Jurii N Pakhomov
Kaisin A Khubiev
Kristian Gerner
L Abalkin
L I Abalkin
L I Abalkin
Larisa I Piiasheva
Leonid I Abalkin
Lev M Gatovskii
Medvedev
Murray Yanowitch
N Gnatovskaia
Neil
Nikolai I Petrakov
Nikolai I Petrakov
Nikolai P Shmelev
Nikolai P Shmelev
Oscar J Bandelin
Pekka Sutela
Pekka Sutela
Peter P Cherkasov
Petr P Cherkasov
Petr P Cherkasov
Petr P Cherkasov
Petr P Cherkasov
Pozdniakov
Rostislav I Kapeliushnikov
Rostislav I Kapeliushnikov
Rynok V Ekonomicheskoi Sisteme Sotsializma
S Ershov
S Voblenko
Soltan S Dzarasov
Sotsializm I Sobstvennost
Stefano Fiori
Stepan V Mochernyi
Streit
T Zaslavskaya
Thomas S Kuhn
Thorstein B Veblen
Vadim A Medvedev
Vasilii Selyunin
Viktor A Efremov
Viktor B Studentsov
Viktor B Studentsov
Viktor I Kuznetsov
Viktor L Sheinis
Viktor P Loginov
Vladimir P Shkredov
Vladimir S Pan&apos
Vsevolod V Kulikov
Warren J Samuels
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

Crossref

Evaluating deterministic motif significance measures in protein databases

Author: A Apostolico
A Ben-Hur
A Ben-Hur
A Brazma
A Field
A Krogh
A Krogh
B Matthews
C Nevill-Manning
E Eskin
E Gasteiger
E Koonin
E Ukkonen
G Bejerano
G Hertz
G Stolovitzky
G van den Eijkel
I Jonassen
I Jonassen
I Rigoutsos
I Rigoutsos
J Buhler
J Han
J Huang
J Yang
JH Zar
K Blekas
M Bramer
M Tompa
NM Abramson
P Baldi
P Pevzner
P Smyth
P Tan
Paulo J Azevedo
Pedro Gabriel Ferreira
PG Ferreira
R Agrawal
R Hart
S Henikoff
S Henikoff
S Jensen
S Lonardi
T Attwood
T Wu
V Guralnik
V Neduva
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. Results From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. Conclusion In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

LOCAS – A Low Coverage Assembly Tool for Resequencing Projects

Author: A Doring
AR Quinlan
B Langmead
C Nusbaum
D Hernandez
D Weigel
Daniel H. Huson
DC Richter
Detlef Weigel
DR Zerbino
EW Myers
H Li
H Li
I Birol
JD Kececioglu
JO Korbel
JT Simpson
Juliane D. Klein
K Schneeberger
K Schneeberger
Korbinian Schneeberger
LE Palmer
M Pop
M Pop
MC Wendl
MJ Chaisson
PA Pevzner
R Li
R Li
RM Durbin
S Ossowski
SL Salzberg
SM Rumble
SQ Le
Stephan Ossowski
T Rausch
Ying Xu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Motivation: Next Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, computational methods capable of discovering other variations such as novel insertions or highly diverged sequence from low coverage NGS data are still lacking. Results: We present LOCAS, a new NGS assembler particularly designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach. LOCAS assembles homologous regions in a homologyguided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project. While assembling the same amount of long insertions as state-of-the-art NGS assemblers, LOCAS showed best results regarding contig size, error rate and runtime. Conclusion: LOCAS produces excellent results for homology-guided assembly of eukaryotic genomes with short reads and low sequencing depth, and therefore appears to be the assembly tool of choice for the detection of novel sequenc

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

MPG.PuRe

ScholarBank@NUS

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Author: A Delcher
A Smit
AC Darling
B Ma
C Kemena
CN Dewey
Darren P. Martin
DR Bentley
E Ohlebusch
EJ Vallender
FP Preparata
G Bejerano
G Bourque
Hachiya Tsuyoshi
I Tabus
JT Simpson
K Liolios
K Mathee
Kris Popendorf
LB Kish
M Blanchette
M Brudno
M Farach
P Pevzner
Pearson
R Rivest
RA Gibbs
RH Waterston
S Quinlan
S Schwartz
SF Altschul
T Hachiya
T Hubbard
TF Smith
W Miller
Y Osana
Yasubumi Sakakibara
Yasunori Osana
Publication venue: Public Library of Science
Publication date: 24/09/2010
Field of study

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima

Author: A Price
AF Neuwald
B Akache
BE Morrow
C Narasimhan
CE Lawrence
CJ Geyer
E Beitz
E Marinari
GE Crooks
I Rigoutsos
J Buhler
J Hu
J Kim
J Kostrowicki
J Zhu
JD Lieb
JO Nehlin
K Blekas
K Hukushima
Kazuhito Shida
M Friberg
M Ishikawa
M Kschischo
M Matsumoto
MC Frith
PA Pevzner
R Elble
R Siddharthan
RV Pappu
S Geman
S Kirkpatrick
S Sinha
T Miyake
TL Bailey
TM Chen
W Kerler
XS Liu
Y Li
Publication venue: BioMed Central
Publication date: 01/11/2006
Field of study

BACKGROUND: Computational discovery of transcription factor binding sites (TFBS) is a challenging but important problem of bioinformatics. In this study, improvement of a Gibbs sampling based technique for TFBS discovery is attempted through an approach that is widely known, but which has never been investigated before: reduction of the effect of local optima. RESULTS: To alleviate the vulnerability of Gibbs sampling to local optima trapping, we propose to combine a thermodynamic method, called simulated tempering, with Gibbs sampling. The resultant algorithm, GibbsST, is then validated using synthetic data and actual promoter sequences extracted from Saccharomyces cerevisiae. It is noteworthy that the marked improvement of the efficiency presented in this paper is attributable solely to the improvement of the search method. CONCLUSION: Simulated tempering is a powerful solution for local optima problems found in pattern discovery. Extended application of simulated tempering for various bioinformatic problems is promising as a robust solution against local optima problems

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central