Search CORE

23 research outputs found

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Author: A Barbour
A Christoffels
CJ Burden
Conrad J Burden
J Burke
JE Carpenter
L Florea
M Kimura
Miriam R Kantorovitz
MR Kantorovitz
MS Waterman
OM Melko
RA Lippert
S Vinga
SF Altschul
Sylvain Forêt
TJ Wu
W Hide
WJ Conover
WJ Kent
WR Pearson
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Identifying Cis-Regulatory Sequences by Word Profile Similarity

Author: A Ivan
A Nasiadka
A Sosinsky
AG Nazina
AP Lifanov
BP Berman
BP Berman
BY Chan
C Zhang
D Bachtrog
DL Halligan
DS Johnson
E Emberly
EA Glazov
EE Hare
EH Davidson
F Poulin
Garmay Leung
H Janssens
I Abnizova
L Li
M Klingler
Michael B. Eisen
MR Kantorovitz
MS Halfon
N Pierstorff
N Rajewsky
Nicholas James Provart
S Prabhakar
S Sinha
XY Li
YH Grad
Publication venue: Public Library of Science
Publication date: 01/09/2009
Field of study

Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples.We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila.Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Erroneous attribution of relevant transcription factor binding sites despite successful prediction of cis-regulatory modules

Author: A Ochoa-Espinosa
A Siepel
A Visel
AA Philippakis
B Estrada
B Morgenstern
BP Berman
Elizabeth R Brennan
GA Maston
J Su
J Zeitlinger
JP Noonan
L Li
M Haeussler
M Markstein
Marc S Halfon
MD Schroeder
MR Kantorovitz
MS Halfon
MS Halfon
N Bray
N Negre
P Van Loo
Qianqian Zhu
R Niwa
S Kahana
T Sandmann
T Sandmann
T Vavouri
W Krivan
WJ Kent
WW Wasserman
XY Li
YH Grad
YH Liu
Yiyun Zhou
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background <it>Cis</it>-regulatory modules are bound by transcription factors to regulate gene expression. Characterizing these DNA sequences is central to understanding gene regulatory networks and gaining insight into mechanisms of transcriptional regulation, but genome-scale regulatory module discovery remains a challenge. One popular approach is to scan the genome for clusters of transcription factor binding sites, especially those conserved in related species. When such approaches are successful, it is typically assumed that the activity of the modules is mediated by the identified binding sites and their cognate transcription factors. However, the validity of this assumption is often not assessed. Results We successfully predicted five new <it>cis</it>-regulatory modules by combining binding site identification with sequence conservation and compared these to unsuccessful predictions from a related approach not utilizing sequence conservation. Despite greatly improved predictive success, the positive set had similar degrees of sequence and binding site conservation as the negative set. We explored the reasons for this by mutagenizing putative binding sites in three <it>cis</it>-regulatory modules. A large proportion of the tested sites had little or no demonstrable role in mediating regulatory element activity. Examination of loss-of-function mutants also showed that some transcription factors supposedly binding to the modules are not required for their function. Conclusions Our results raise important questions about interpreting regulatory module predictions obtained by finding clusters of conserved binding sites. Attribution of function to these sites and their cognate transcription factors may be incorrect even when modules are successfully identified. Our study underscores the importance of empirical validation of computational results even when these results are in line with expectation.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CSMET: Comparative Genomic Motif Detection via Multi-Resolution Phylogenetic Shadowing

Author: A Sandelin
A Siepel
AC Siepel
AC Siepel
AM Moses
AM Moses
AM Moses
BE Engelhardt
C Bergman
C Boutilier
CM Bergman
D Boffelli
DA Papatsenko
EH Margulies
EP Xing
EP Xing
Eric P. Xing
GE Crooks
GJ Olsen
I Dubchak
J Felsenstein
J Felsenstein
J Felsenstein
J Pedersen
JD McAuliffe
M Blanchette
M Blanchette
M Blanchette
M Hasegawa
M Tompa
MC Frith
Mladen Kolar
MR Kantorovitz
MZ Ludwig
MZ Ludwig
MZ Ludwig
Pradipta Ray
PV Benos
R Siddharthan
RG Cowell
S Sinha
S Sinha
SB Montgomery
Suyash Shringarpure
T Wang
TH Jukes
Uwe Ohler
W Huang
Publication venue: Public Library of Science
Publication date: 01/06/2008
Field of study

Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Integrating Diverse Datasets Improves Developmental Enhancer Prediction

Author: A Arvey
A Barski
A Ben-Hur
A He
A Miquelajauregui
A Rada-Iglesias
A Siepel
A Visel
A Visel
A Visel
A Visel
A Visel
A Visel
A Woolfe
A Woznica
AI Su
AP Boyle
AR Quinlan
AS Nord
BW Busser
C Cheng
C Jin
C Leslie
CE Grant
CM Koch
CT Ong
CY McLean
D Lee
D May
D Wang
Dennis Kostka
DM McGaughey
DS Johnson
DU Gorkin
E Birney
E Seuntjens
G Cuellar-Partida
GE Zentner
Genevieve D. Erwin
GM Burzynski
H Lahdesmaki
HH He
I Dunham
J Banerji
J Cotney
J Ernst
JA Capra
JA Capra
JA Wamstad
John A. Capra
JP Noonan
K Koshiba-Takeuchi
K Lindblad-Toh
KA Aldinger
Karl K. Murphy
Katherine S. Pollard
KJ Won
KS Pollard
KY Yip
L Narlikar
L Taher
LA Hindorff
LA Pennacchio
M Bulger
M Kloft
M Levine
M Wilson
MA Nobrega
MA White
MJ Blow
MM El-Kasti
MM Hoffman
MP Creyghton
MR Kantorovitz
N Oksenberg
N Rajagopal
Nadav Ahituv
ND Heintzman
ND Heintzman
NE Renthal
Nir Oksenberg
NJ Sakabe
PG Giresi
Q Li
Q Weng
R Andersson
R O'Rahilly
R Pique-Regi
RE Thurman
Rebecca M. Truty
RP Zinzen
RS Smith
S Bonn
S Ghisletti
S Lomvardas
S Prabhakar
S Salzberg
S Sonnenburg
S Sonnenburg
SD Gillies
SJ Sholtis
SL Paige
T Casci
T Kume
T Kume
TG Dietterich
TS Mikkelsen
UA Orom
Uwe Ohler
VW Zhou
Z Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 27/09/2013
Field of study

Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology. © 2014 Erwin et al

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

D-Scholarship@Pitt

FigShare

Clustering of reads with alignment-free measures and quality values

Author: A Solovyov
Andrea Leoni
B Ewing
BE Blaisdell
CA Albers
D Medini
DR Zerbino
E Bao
E Birney
G Reinert
GE Sims
H Li
J Göke
J Qi
K Song
L Gao
L Wan
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
Matteo Comin
Michele Schimd
MO Carneiro
MR Kantorovitz
Q Dai
R Jothi
RA Lippert
S Vinga
SF Altschul
W Qu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A Machine Learning Approach for Identifying Novel Cell Type–Specific Transcriptional Regulators of Myogenesis

Author: A Carmena
A Carmena
A Carmena
A Dastjerdi
A Erives
A Ivan
A Nose
A Paululat
A Siepel
A Subramanian
A Visel
A Visel
A Woolfe
AA Philippakis
AC Groth
AG Nazina
AG Nazina
AK Holloway
Alan M. Michelson
AM Michelson
AM Michelson
B Estrada
B Hanczar
BL Black
BP Berman
Brian W. Busser
BW Busser
C Bourgouin
C Chang
C Jiang
C Klämbt
CA Berkes
CI Swanson
CT Ong
DN Arnosti
DT Odom
E Davidson
EE Hare
EN Olson
FC Wardle
G Hon
G Junion
G Leung
G Ranganayakulu
GE Crawford
GG Loots
H Brohmann
H Rouault
HP Shih
I Abnizova
I Costello
I Guyon
I Ovcharenko
I Reim
I Reim
Ivan Ovcharenko
J Bischof
J Crocker
J Crocker
J Enriquez
J Ernst
J Shawe-Taylor
J Zeitlinger
JA Pederson
James W. Posakony
JD Pederson
JM Claycomb
JS Jakobsen
JW Mahaffey
K Jagla
K Robasky
K Senger
L Dubois
L Li
L Narlikar
L Narlikar
L Narlikar
Leila Taher
M Capovilla
M Frasch
M Ludwig
M Markstein
M Markstein
M Porsch
M Ruiz-Gomez
M Schwaiger
MA Beer
MB Noyes
MD Biggin
MF Berger
MI Arnone
MJ Blow
MK Baylies
MK Baylies
MK Baylies
MK Gross
Molly J. Bloom
MR Kantorovitz
MS Halfon
MS Halfon
MV Taylor
N Negre
N Reeves
OL Griffith
P Tomancak
PJ Clyne
R Bodmer
R Galant
RG Ramsay
RJ Bryson-Richardson
RP Zinzen
S Barolo
S Knirr
S Knirr
S MacArthur
S Mahony
SA Ness
SB Carroll
SD Weatherbee
SJ Raudys
SM Gallo
SY Kim
T Jagla
T Sandmann
T Sandmann
Terese Tansey
TL Bailey
U Grossniklaus
V Matys
V Tixier
Y Benjamini
YH Liu
Yongsok Kim
Z Han
Publication venue: Public Library of Science
Publication date: 08/03/2012
Field of study

Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA–based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type–specific developmental gene expression patterns

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare