Search CORE

Metamotifs--a generative model for building families of nucleotide position weight matrices.

Author: Down Thomas A
Hubbard Tim Jp
Piipari Matias
Publication venue: BMC Bioinformatics
Publication date: 25/06/2010
Field of study

BACKGROUND: Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence. RESULTS: We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain. CONCLUSIONS: We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

Apollo (Cambridge)

King's Research Portal

GRISOTTO: A greedy approach to improve combinatorial algorithms for motif discovery with prior knowledge

Author: A Valouev
Alexandra M Carvalho
AM Carvalho
AP Fejes
Arlindo L Oliveira
C Deremble
C Lee
CT Harbison
D Ucar
E Segal
E Valen
F Daenen
G Paillard
G Paillard
G Pavesi
GC Yuan
I Lafontaine
I Lafontaine
I Lafontaine
IV Kulakovskiy
JV Ponomarenko
KD MacIsaac
L Marsan
L Narlikar
L Narlikar
M Hu
M Kellis
MF Sagot
N Pisanti
R Gordân
R Gordân
R Gordân
R Pudimat
R Siddharthan
RA O'Flanagan
RG Beiko
S Sinha
T Wang
TL Bailey
TL Bailey
V Matys
WW Wasserman
X Chen
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Position-specific priors (PSP) have been used with success to boost EM and Gibbs sampler-based motif discovery algorithms. PSP information has been computed from different sources, including orthologous conservation, DNA duplex stability, and nucleosome positioning. The use of prior information has not yet been used in the context of combinatorial algorithms. Moreover, priors have been used only independently, and the gain of combining priors from different sources has not yet been studied. Results We extend RISOTTO, a combinatorial algorithm for motif discovery, by post-processing its output with a greedy procedure that uses prior information. PSP's from different sources are combined into a scoring criterion that guides the greedy search procedure. The resulting method, called GRISOTTO, was evaluated over 156 yeast TF ChIP-chip sequence-sets commonly used to benchmark prior-based motif discovery algorithms. Results show that GRISOTTO is at least as accurate as other twelve state-of-the-art approaches for the same task, even without combining priors. Furthermore, by considering combined priors, GRISOTTO is considerably more accurate than the state-of-the-art approaches for the same task. We also show that PSP's improve GRISOTTO ability to retrieve motifs from mouse ChiP-seq data, indicating that the proposed algorithm can be applied to data from a different technology and for a higher eukaryote. Conclusions The conclusions of this work are twofold. First, post-processing the output of combinatorial algorithms by incorporating prior information leads to a very efficient and effective motif discovery method. Second, combining priors from different sources is even more beneficial than considering them separately.</p

PriorsEditor: a tool for the creation and use of positional priors in motif discovery

Author: Bailey
Bellora
Duret
F. Drablos
K. Klepper
Kolbe
Lahdesmaki
Matys
Narlikar
Ravasi
Segal
Stormo
Wasserman
Publication venue: Oxford University Press
Publication date
Field of study

Summary: Computational methods designed to discover transcription factor binding sites in DNA sequences often have a tendency to make a lot of false predictions. One way to improve accuracy in motif discovery is to rely on positional priors to focus the search to parts of a sequence that are considered more likely to contain functional binding sites. We present here a program called PriorsEditor that can be used to create such positional priors tracks based on a combination of several features, including phylogenetic conservation, nucleosome occupancy, histone modifications, physical properties of the DNA helix and many more

Public Library of Science (PLOS)

Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources

Author: A Ambesi-Impiombato
A Bernard
A Beyer
A Sandelin
A Sandelin
A Siepel
AFA Smit
Alistair G. Rust
B Ren
CE Lawrence
CL Warren
CP Robert
CT Harbison
D GuhaThakurta
D Husmeier
D Husmeier
David Jones
DB Gordon
DJ Reiss
DJ Wilkinson
DT Holloway
DT Holloway
E Blanco
E Segal
E Segal
E Wingender
EH Davidson
G Chen
G Thijs
G Thijs
GD Stormo
GE Crawford
H Huang
H Lähdesmäki
H Steck
Harri Lähdesmäki
Ilya Shmulevich
IV Bajić
J Taylor
JD Hughes
JM Claverie
K Quandt
K Thomas
KD MacIsaac
KP Murphy
L Hertzberg
L Narlikar
L Narlikar
L Narlikar
L Zhang
M Eisenstein
M Kellis
M Levine
M Tompa
MA Beer
MC Frith
MF Berger
MJL de Hoon
ML Bulyk
N Friedman
N Rajewsky
ND Heintzman
O Hallikas
OV Kel-Margoulis
Q Zhou
R Siddharthan
R Staden
S Cawley
S Mukherjee
S Sinha
S Sinha
SB Montgomery
SJ Maerkl
SP Brooks
ST Jensen
T Chen
T Fawcett
T Reguly
TD Wu
TI Lee
TL Bailey
TL Bailey
VD Marinescu
W Pan
WJ Kent
WP Lehrach
WW Wasserman
X Liu
X Xie
XS Liu
Y Barash
Y Barash
Y Qi
Y Tamada
Publication venue: Public Library of Science
Publication date: 01/03/2008
Field of study

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org

High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints

Author: Guo Yuchun
Mahony Shaun
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/03/2012
Field of study

An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control

DSpace@MIT

Proceedings of the 2008 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference

Author: A Churbanov
A Churbanov
A Fujita
A Gyenesei
A Hijikata
A Rawat
A Shipra
AA Ptitsyn
AA Ptitsyn
AA Ptitsyn
AW Schreiber
B Roux
CA Bottoms
CB Giles
D Quest
D Sean
D Wilkins
Dawn Wilkins
ES Chen
G Gamberoni
H Hong
H Liu
H Meng
H Xu
HM Bovelstad
I Fishel
I Medina
James C Fuscoe
Jonathan D Wren
JS Yuan
JS Zielinski
JW Fan
K Thomson
L Guo
L Hertzberg
L Narlikar
L Shi
LK Schnackenberg
LL Elo
M Chae
M Landry
M Mete
M Mete
M Pirooznia
MA Hibbs
MD Dyer
MF Burkart
MG Dozmorov
MG Dozmorov
MK Das
N Mei
ND Mukhopadhyay
O Uzuner
P Li
P Minguez
QH Zhu
R Loganantharaj
RL Frank
RS Wang
S Gao
S Martin
S Sonnenburg
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Yuan
SB Montgomery
SM Bridges
Stephen Winters-Hilt
Susan Bridges
T Huan
T Lee
V Kulkarni
V Nagarajan
VI Torvik
WK Lim
WS Sanders
X Chen
Y Ding
Y Gusev
Y Huang
Y Lin
Yuriy Gusev
Z Su
Z Yu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Transcription factor site dependencies in human, mouse and rat genomes

Author: A Di Cara
A Gyenesei
A Sandelin
A Sandelin
A Tomovic
A Tomovic
AG Jegga
AH Brivanlou
AJ Walhout
Andrija Tomovic
AV Morozov
B Lenhard
C Kunsch
CC Liu
D Choi
D GuhaThakurta
DC King
DE Schones
DH Crouch
Edward J Oakeley
G Caretti
G Robertson
G Zhao
H Klein
H Wang
IJ Donaldson
IJ Donaldson
J Carabana
J Karlseder
L Narlikar
L Narlikar
M Blanchette
M Defrance
Michael Stadler
O Puig
PR van Ginkel
R Sharan
R Sharan
S Impey
S Mahony
SJ Ho Sui
SM Kielbasa
T Mahmoudi
V Ferretti
W Thompson
WB Alkema
WW Wasserman
X Yan
X Zhang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background It is known that transcription factors frequently act together to regulate gene expression in eukaryotes. In this paper we describe a computational analysis of transcription factor site dependencies in human, mouse and rat genomes. Results Our approach for quantifying tendencies of transcription factor binding sites to co-occur is based on a binding site scoring function which incorporates dependencies between positions, the use of information about the structural class of each transcription factor (major/minor groove binder), and also considered the possible implications of varying GC content of the sequences. Significant tendencies (dependencies) have been detected by non-parametric statistical methodology (permutation tests). Evaluation of obtained results has been performed in several ways: reports from literature (many of the significant dependencies between transcription factors have previously been confirmed experimentally); dependencies between transcription factors are not biased due to similarities in their DNA-binding sites; the number of dependent transcription factors that belong to the same functional and structural class is significantly higher than would be expected by chance; supporting evidence from GO clustering of targeting genes. Based on dependencies between two transcription factor binding sites (second-order dependencies), it is possible to construct higher-order dependencies (networks). Moreover results about transcription factor binding sites dependencies can be used for prediction of groups of dependent transcription factors on a given promoter sequence. Our results, as well as a scanning tool for predicting groups of dependent transcription factors binding sites are available on the Internet. Conclusion We show that the computational analysis of transcription factor site dependencies is a valuable complement to experimental approaches for discovering transcription regulatory interactions and networks. Scanning promoter sequences with dependent groups of transcription factor binding sites improve the quality of transcription factor predictions.</p

The Novartis Repository

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

Author: A Bernal
A Culotta
A Feelders
AE Kel
AL Berger
AY Ng
C Burge
CM Bishop
D Cai
D Grossman
D Heckerman
D Klein
E Redhead
E Segal
F Pernkopf
G Yeo
GD Stormo
H Wallach
H Wettig
HE Peckham
I Ben-Gal
Ivo Grosse
J Cerquides
J Davis
J Goodman
J Grau
J Keilwagen
Jan Grau
Jens Keilwagen
L Narlikar
M Arita
M Meila-Predoviciu
M Tompa
M Zhang
MI Jordan
NK Kim
O Schulte
O Yakhnenko
P Grünwald
R Castelo
R Castelo
R Greiner
R Staden
S Chen
S Sonnenburg
SL Salzberg
Stefan Posch
T Fawcett
TH Kim
TM Chen
WL Buntine
Y Barash
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the <it>same a-priori information</it>, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior is implemented in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.</p