Search CORE

Fine-Tuning Enhancer Models to Predict Transcriptional Targets across Multiple Genomes

Author: A Ochoa-Espinosa
A Siepel
A Stark
AA Philippakis
AM Moses
AP Lifanov
B Adryan
BA Hassan
Bassem A. Hassan
BY Chan
CM Frith
D Karolchik
D Karolchik
DC King
DM Schroeder
E Emberly
E Segal
EH Davidson
G Thijs
Guillaume Bourque
GZ Hertz
IE Boyle
J van Helden
Jacques van Helden
JE Ostrin
JM Stuart
LW Chang
M Blanchette
M Brudno
M Markstein
M Pritsker
M Rebeiz
M Tompa
MC Bergman
MS Halfon
N Rajewsky
NV Taverner
O Johansson
Olivier Sand
PB Berman
PI zur Lage
R Siddharthan
S Aerts
S Aerts
S Kurtz
S Sinha
SB Montgomery
SM Gallo
SR Eddy
Stein Aerts
T Zhang
TL Bailey
WJ Kent
WW Wasserman
Y Sun
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35th TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks

Lirias

HAL AMU

arXiv.org e-Print Archive

DI-fusion

Formation of regulatory modules by local sequence duplication

Author: A Stark
A Tanay
AL Halpern
AM Moses
AM Moses
AM Moses
Amos Tanay
Armita Nourmohammad
B Ondek
BP Berman
CM Bergman
CM Bergman
CT Harbison
D Gruen
D Stanojevic
DN Arnosti
DS Fields
E Segal
EE Hare
EH Davidson
EH Davidson
G Badis
G Benson
G Leung
GD Stormo
I Abnizova
J Berg
J Berg
J Monod
JM Hancock
K Thornton
L Li
M Kimura
M Kimura
M Levine
M Lynch
M Lynch
M Lässig
M Markstein
M Pachkov
M Ptashne
MC King
MD Vinces
Michael Lässig
MM Kulkarni
MS Halfon
MS Halfon
MV Katti
MZ Ludwig
MZ Ludwig
MZ Ludwig
MZ Ludwig
N Rajewsky
NE Buchler
O Berg
PW Messer
R Durbin
RJ Britten
RW Lusk
S Kullback
S Mukherjee
S Sinha
S Sinha
S Sinha
S Small
SJ Maerkl
SM Gallo
SW Doniger
V Boeva
V Mustonen
V Mustonen
V Mustonen
Z Wunderlich
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2011
Field of study

Turnover of regulatory sequence and function is an important part of molecular evolution. But what are the modes of sequence evolution leading to rapid formation and loss of regulatory sites? Here, we show that a large fraction of neighboring transcription factor binding sites in the fly genome have formed from a common sequence origin by local duplications. This mode of evolution is found to produce regulatory information: duplications can seed new sites in the neighborhood of existing sites. Duplicate seeds evolve subsequently by point mutations, often towards binding a different factor than their ancestral neighbor sites. These results are based on a statistical analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome, and a comparison set of intergenic regulatory sequence in Saccharomyces cerevisiae. In fly regulatory modules, pairs of binding sites show significantly enhanced sequence similarity up to distances of about 50 bp. We analyze these data in terms of an evolutionary model with two distinct modes of site formation: (i) evolution from independent sequence origin and (ii) divergent evolution following duplication of a common ancestor sequence. Our results suggest that pervasive formation of binding sites by local sequence duplications distinguishes the complex regulatory architecture of higher eukaryotes from the simpler architecture of unicellular organisms

Kölner UniversitätsPublikationsServer

Alignment and Prediction of cis-Regulatory Modules Based on a Probabilistic Model of Evolution

Author: A Bais
A Halpern
A Lifanov
A Moses
A Moses
A Moses
A Siepel
B Berman
B Knudsen
C Bergman
C Bergman
C Dewey
D Halligan
D Karolchik
D Pollard
D Pollard
D Raijman
E Berezikov
E Birney
E Blackwood
E Davidson
E Dermitzakis
F Gao
G Lunter
G Lunter
G Lunter
G Stormo
G Wray
G Wray
I Holmes
I Holmes
I Holmes
I Miklos
J Berg
J Stone
J Thorne
J Thorne
J Warner
K Wong
M Brudno
M Frith
M Frith
M Hasegawa
M Ludwig
M Ludwig
M Noyes
O Hallikas
P Andolfatto
P Keightley
P Kheradpour
P Ray
P Tomancak
R Cartwright
R Durrett
R Satija
R Siddharthan
R Waterston
S Aerts
S Doniger
S Gallo
S MacArthur
S Sinha
S Sinha
Saurabh Sinha
V Mustonen
W Huang
W Wasserman
W Wong
Wyeth W. Wasserman
X Li
X Li
Xin He
Xu Ling
Z Hu
Publication venue: Public Library of Science
Publication date: 01/03/2009
Field of study

Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context

Multigenome DNA sequence conservation identifies Hox cis-regulatory elements

Author: De Buysscher Tristan
DeModena John A.
Kuntz Steven G.
Schwarz Erich M.
Shizuya Hiroaki
Sternberg Paul W.
Trout Diane
Wold Barbara J.
Publication venue: Cold Spring Harbor Laboratory Press
Publication date: 01/12/2008
Field of study

To learn how well ungapped sequence comparisons of multiple species can predict cis-regulatory elements in Caenorhabditis elegans, we made such predictions across the large, complex ceh-13/lin-39 locus and tested them transgenically. We also examined how prediction quality varied with different genomes and parameters in our comparisons. Specifically, we sequenced ∼0.5% of the C. brenneri and C. sp. 3 PS1010 genomes, and compared five Caenorhabditis genomes (C. elegans, C. briggsae, C. brenneri, C. remanei, and C. sp. 3 PS1010) to find regulatory elements in 22.8 kb of noncoding sequence from the ceh-13/lin-39 Hox subcluster. We developed the MUSSA program to find ungapped DNA sequences with N-way transitive conservation, applied it to the ceh-13/lin-39 locus, and transgenically assayed 21 regions with both high and low degrees of conservation. This identified 10 functional regulatory elements whose activities matched known ceh-13/lin-39 expression, with 100% specificity and a 77% recovery rate. One element was so well conserved that a similar mouse Hox cluster sequence recapitulated the native nematode expression pattern when tested in worms. Our findings suggest that ungapped sequence comparisons can predict regulatory elements genome-wide

Caltech Authors

Functional Characterization of Transcription Factor Motifs Using Cross-species Comparison across Large Evolutionary Distances

Author: A Louvi
A Rogulja-Ortmann
A Stark
AC Edwards
AE Kel
AI Su
AS Adler
AV Morozov
B van Steensel
BD McCabe
BP Berman
BP Berman
Brian James
C Kwong
C Rushlow
C Rushlow
C van Waveren
CM Bergman
CT Harbison
CW Whitfield
CW Whitfield
D Karolchik
D Porcelli
D Vlieghe
DE Newburger
E Kurant
E Segal
EJ Ward
Evgeny M. Zdobnov
F Casares
GD Stormo
Gene E. Robinson
HG Roider
HG Roider
HM Berman
Hugh M. Robertson
J DeZazzo
J Pinnell
J Wang
J Zeitlinger
J Zhu
JA Lynch
Jaebum Kim
JB Warner
JD Gibson
JD Storey
JG Gindhart Jr
JH Werren
John H. Werren
Joshua D. Gibson
JR Desjarlais
JZ Parrish
KM Bhat
LA Pennacchio
LD Ward
LF Sempere
LW Chang
M Ashburner
M Blanchette
M Boden
M Delorenzi
M Kanehisa
M Kellis
MA Crosby
MB Noyes
MB Noyes
MC Frith
MC Frith
ME Fortini
MS Halfon
N Rajewsky
Oliver Niehuis
P Kheradpour
PK Sorger
R Garesse
R Gordân
RC Scarpulla
RD Finn
Ryan Cunningham
S Grossmann
S Robin
S Roy
S Sinha
S Sinha
S Sinha
SA Ramsey
Saurabh Sinha
Stefan Wyder
T Berleth
TE Creighton
TL Bailey
U Keich
V Matys
W Huang da
WC Xiong
WJ Nelson
WW Wasserman
Wyeth W. Wasserman
X Xie
X Zhou
XY Li
Y Haraguchi
Z Huang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

We address the problem of finding statistically significant associations between cis-regulatory motifs and functional gene sets, in order to understand the biological roles of transcription factors. We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif–function associations. This framework is applied to the recently sequenced genome of the jewel wasp, Nasonia vitripennis, making use of the existing knowledge of motifs and gene annotations in another insect genome, that of the fruitfly. The framework uses cross-species comparison to improve the specificity of its predictions, and does so without relying upon non-coding sequence alignment. It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable. We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations

CiteSeerX

Archive ouverte UNIGE

Assessing Computational Methods of Cis-Regulatory Module Prediction

Author: A Bruhat
A Siepel
A Sosinsky
A Visel
AB Rose
AG Clark
AL Halpern
AM Moses
B Prud'homme
B Shi
BK Peterson
BP Berman
BY Chan
Christina Leslie
CM Bergman
CM Bergman
D Kolbe
D Papatsenko
DA Kleinjan
DC King
DC King
DE Schones
DM Jeziorska
DS Johnson
E Birney
E Davidson
E Emberly
E Segal
E Wingender
G Bejerano
GM Euskirchen
H Wang
H Weintraub
JB Warner
Jing Su
JL Kabat
JR Stone
JS Jakobsen
KH Surinya
KJ Won
L Li
LP Lim
M Bieda
M Blanchette
M Brudno
M Hasegawa
MC Frith
MD Schroeder
MD Wilson
MS Halfon
MS Halfon
MZ Ludwig
N Bray
N Ghanem
N Gompel
N Pierstorff
ND Heintzman
ND Heintzman
O Hallikas
O Johansson
OV Kel-Margoulis
P Van Loo
PC FitzGerald
PJ Sabo
Q Zhou
Q Zhou
R Godbout
RP Zinzen
S Aerts
S Aerts
S Batzoglou
S Karlin
S MacArthur
S Richards
S Sinha
S Sinha
S Sinha
Sarah A. Teichmann
SC Parker
SE Celniker
T Sandmann
T Strachan
T Waleev
Thomas A. Down
TL Bailey
TM Williams
V Ferretti
V Gotea
W Krivan
WW Wasserman
X He
X He
XY Li
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods

CiteSeerX

MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences

Author: Arnold Phil
Erb Ionas
Molina Nacho
Pachkov Mikhail
van Nimwegen Erik
Publication venue
Publication date: 02/08/2017
Field of study

Motivation: Probabilistic approaches for inferring transcription factor binding sites (TFBSs) and regulatory motifs from DNA sequences have been developed for over two decades. Previous work has shown that prediction accuracy can be significantly improved by incorporating features such as the competition of multiple transcription factors (TFs) for binding to nearby sites, the tendency of TFBSs for co-regulated TFs to cluster and form cis-regulatory modules and explicit evolutionary modeling of conservation of TFBSs across orthologous sequences. However, currently available tools only incorporate some of these features, and significant methodological hurdles hampered their synthesis into a single consistent probabilistic framework. Results: We present MotEvo, a integrated suite of Bayesian probabilistic methods for the prediction of TFBSs and inference of regulatory motifs from multiple alignments of phylogenetically related DNA sequences, which incorporates all features just mentioned. In addition, MotEvo incorporates a novel model for detecting unknown functional elements that are under evolutionary constraint, and a new robust model for treating gain and loss of TFBSs along a phylogeny. Rigorous benchmarking tests on ChIP-seq datasets show that MotEvo's novel features significantly improve the accuracy of TFBS prediction, motif inference and enhancer prediction. Availability: Source code, a user manual and files with several example applications are available at www.swissregulon.unibas.ch. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

RERO DOC Digital Library

A Machine Learning Approach for Identifying Novel Cell Type–Specific Transcriptional Regulators of Myogenesis

Author: A Carmena
A Carmena
A Carmena
A Dastjerdi
A Erives
A Ivan
A Nose
A Paululat
A Siepel
A Subramanian
A Visel
A Visel
A Woolfe
AA Philippakis
AC Groth
AG Nazina
AG Nazina
AK Holloway
Alan M. Michelson
AM Michelson
AM Michelson
B Estrada
B Hanczar
BL Black
BP Berman
Brian W. Busser
BW Busser
C Bourgouin
C Chang
C Jiang
C Klämbt
CA Berkes
CI Swanson
CT Ong
DN Arnosti
DT Odom
E Davidson
EE Hare
EN Olson
FC Wardle
G Hon
G Junion
G Leung
G Ranganayakulu
GE Crawford
GG Loots
H Brohmann
H Rouault
HP Shih
I Abnizova
I Costello
I Guyon
I Ovcharenko
I Reim
I Reim
Ivan Ovcharenko
J Bischof
J Crocker
J Crocker
J Enriquez
J Ernst
J Shawe-Taylor
J Zeitlinger
JA Pederson
James W. Posakony
JD Pederson
JM Claycomb
JS Jakobsen
JW Mahaffey
K Jagla
K Robasky
K Senger
L Dubois
L Li
L Narlikar
L Narlikar
L Narlikar
Leila Taher
M Capovilla
M Frasch
M Ludwig
M Markstein
M Markstein
M Porsch
M Ruiz-Gomez
M Schwaiger
MA Beer
MB Noyes
MD Biggin
MF Berger
MI Arnone
MJ Blow
MK Baylies
MK Baylies
MK Baylies
MK Gross
Molly J. Bloom
MR Kantorovitz
MS Halfon
MS Halfon
MV Taylor
N Negre
N Reeves
OL Griffith
P Tomancak
PJ Clyne
R Bodmer
R Galant
RG Ramsay
RJ Bryson-Richardson
RP Zinzen
S Barolo
S Knirr
S Knirr
S MacArthur
S Mahony
SA Ness
SB Carroll
SD Weatherbee
SJ Raudys
SM Gallo
SY Kim
T Jagla
T Sandmann
T Sandmann
Terese Tansey
TL Bailey
U Grossniklaus
V Matys
V Tixier
Y Benjamini
YH Liu
Yongsok Kim
Z Han
Publication venue: Public Library of Science
Publication date: 08/03/2012
Field of study

Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA–based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type–specific developmental gene expression patterns

CiteSeerX