34,039 research outputs found
On Weight Matrix and Free Energy Models for Sequence Motif Detection
The problem of motif detection can be formulated as the construction of a
discriminant function to separate sequences of a specific pattern from
background. In computational biology, motif detection is used to predict DNA
binding sites of a transcription factor (TF), mostly based on the weight matrix
(WM) model or the Gibbs free energy (FE) model. However, despite the wide
applications, theoretical analysis of these two models and their predictions is
still lacking. We derive asymptotic error rates of prediction procedures based
on these models under different data generation assumptions. This allows a
theoretical comparison between the WM-based and the FE-based predictions in
terms of asymptotic efficiency. Applications of the theoretical results are
demonstrated with empirical studies on ChIP-seq data and protein binding
microarray data. We find that, irrespective of underlying data generation
mechanisms, the FE approach shows higher or comparable predictive power
relative to the WM approach when the number of observed binding sites used for
constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table
Formation of regulatory modules by local sequence duplication
Turnover of regulatory sequence and function is an important part of
molecular evolution. But what are the modes of sequence evolution leading to
rapid formation and loss of regulatory sites? Here, we show that a large
fraction of neighboring transcription factor binding sites in the fly genome
have formed from a common sequence origin by local duplications. This mode of
evolution is found to produce regulatory information: duplications can seed new
sites in the neighborhood of existing sites. Duplicate seeds evolve
subsequently by point mutations, often towards binding a different factor than
their ancestral neighbor sites. These results are based on a statistical
analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome,
and a comparison set of intergenic regulatory sequence in Saccharomyces
cerevisiae. In fly regulatory modules, pairs of binding sites show
significantly enhanced sequence similarity up to distances of about 50 bp. We
analyze these data in terms of an evolutionary model with two distinct modes of
site formation: (i) evolution from independent sequence origin and (ii)
divergent evolution following duplication of a common ancestor sequence. Our
results suggest that pervasive formation of binding sites by local sequence
duplications distinguishes the complex regulatory architecture of higher
eukaryotes from the simpler architecture of unicellular organisms
Real sequence effects on the search dynamics of transcription factors on DNA
Recent experiments show that transcription factors (TFs) indeed use the
facilitated diffusion mechanism to locate their target sequences on DNA in
living bacteria cells: TFs alternate between sliding motion along DNA and
relocation events through the cytoplasm. From simulations and theoretical
analysis we study the TF-sliding motion for a large section of the DNA-sequence
of a common E. coli strain, based on the two-state TF-model with a fast-sliding
search state and a recognition state enabling target detection. For the
probability to detect the target before dissociating from DNA the TF-search
times self-consistently depend heavily on whether or not an auxiliary operator
(an accessible sequence similar to the main operator) is present in the genome
section. Importantly, within our model the extent to which the interconversion
rates between search and recognition states depend on the underlying nucleotide
sequence is varied. A moderate dependence maximises the capability to
distinguish between the main operator and similar sequences. Moreover, these
auxiliary operators serve as starting points for DNA looping with the main
operator, yielding a spectrum of target detection times spanning several orders
of magnitude. Auxiliary operators are shown to act as funnels facilitating
target detection by TFs.Comment: 26 pages, 7 figure
Beyond position weight matrices: nucleotide correlations in transcription factor binding sites and their description
The identification of transcription factor binding sites (TFBSs) on genomic
DNA is of crucial importance for understanding and predicting regulatory
elements in gene networks. TFBS motifs are commonly described by Position
Weight Matrices (PWMs), in which each DNA base pair independently contributes
to the transcription factor (TF) binding, despite mounting evidence of
interdependence between base pairs positions. The recent availability of
genome-wide data on TF-bound DNA regions offers the possibility to revisit this
question in detail for TF binding {\em in vivo}. Here, we use available fly and
mouse ChIPseq data, and show that the independent model generally does not
reproduce the observed statistics of TFBS, generalizing previous observations.
We further show that TFBS description and predictability can be systematically
improved by taking into account pairwise correlations in the TFBS via the
principle of maximum entropy. The resulting pairwise interaction model is
formally equivalent to the disordered Potts models of statistical mechanics and
it generalizes previous approaches to interdependent positions. Its structure
allows for co-variation of two or more base pairs, as well as secondary motifs.
Although models consisting of mixtures of PWMs also have this last feature, we
show that pairwise interaction models outperform them. The significant pairwise
interactions are found to be sparse and found dominantly between consecutive
base pairs. Finally, the use of a pairwise interaction model for the
identification of TFBSs is shown to give significantly different predictions
than a model based on independent positions
TherMos: Estimating protein-DNA binding energies from in vivo binding profiles
Accurately characterizing transcription factor (TF)-DNA affinity is a central goal of regulatory genomics. Although thermodynamics provides the most natural language for describing the continuous range of TF-DNA affinity, traditional motif discovery algorithms focus instead on classification paradigms that aim to discriminate 'bound' and 'unbound' sequences. Moreover, these algorithms do not directly model the distribution of tags in ChIP-seq data. Here, we present a new algorithm named Thermodynamic Modeling of ChIP-seq (TherMos), which directly estimates a positionspecific binding energy matrix (PSEM) from ChIPseq/exo tag profiles. In cross-validation tests on seven genome-wide TF-DNA binding profiles, one of which we generated via ChIP-seq on a complex developing tissue, TherMos predicted quantitative TF-DNA binding with greater accuracy than five well-known algorithms. We experimentally validated TherMos binding energy models for Klf4 and Esrrb, using a novel protocol to measure PSEMs in vitro. Strikingly, our measurements revealed strong nonadditivity at multiple positions within the two PSEMs. Among the algorithms tested, only TherMos was able to model the entire binding energy landscape of Klf4 and Esrrb. Our study reveals new insights into the energetics of TF-DNA binding in vivo and provides an accurate first-principles approach to binding energy inference from ChIP-seq and ChIP-exo data. © 2013 The Author(s).Link_to_subscribed_fulltex
- …