Location of Repository

Variable structure motifs for transcription factor binding sites

By J. E. (John E.) Reid, Kenneth J. Evans, Nigel Dyer, Lorenz Wernisch and Sascha Ott


Background: Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets. \ud Results: We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance. \ud Conclusions: We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1

Topics: QH426
Publisher: BioMed Central Ltd.
Year: 2010
OAI identifier: oai:wrap.warwick.ac.uk:3006

Suggested articles



  1. (2000). 5 A resolution. doi
  2. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. doi
  3. (2009). A: The p53HMM algorithm: using profile hidden markov models to detect p53-responsive genes. doi
  4. (2008). A: Transcriptional control of human p53-regulated genes. Nat Rev Mol Cell Biol doi
  5. (2000). Aggarwal AK, Rosenfeld MG: Allosteric effects of Pit-1 DNA sites on long-term repression in cell type specification. Science
  6. (2007). Algorithms on strings, trees, and sequences: computer science and computational biology Cambridge Univ. doi
  7. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol doi
  8. (2007). Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell doi
  9. (1998). Approaches to the automatic discovery of patterns in biosequences. doi
  10. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol doi
  11. (2007). B: Genome-wide mapping of in vivo protein-DNA interactions. Science doi
  12. (2009). Bulyk ML: Diversity and Complexity in DNA Recognition by Transcription Factors. Science doi
  13. (2009). Bulyk ML: UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. doi
  14. (2009). Bulyk ML: Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc doi
  15. (2006). Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics doi
  16. (2002). Chu G: p53 Binds and activates the xeroderma pigmentosum DDB2 gene in humans but not mice. Mol Cell Biol doi
  17. (1999). Desplan C: Structural basis of Hox specificity.
  18. (2007). Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol doi
  19. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res doi
  20. (2000). DNA binding sites: representation and discovery. Bioinformatics doi
  21. (2001). DNA binding specificity of different STAT proteins. Comparison of in vitro specificity with natural target sites. doi
  22. (2007). Drabløs F: Improved benchmarks for computational motif discovery. doi
  23. (2008). E: A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol doi
  24. (2006). E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res
  25. (1992). EI: Explaining the Gibbs Sampler. The American Statistician doi
  26. (2008). Eukaryotic transcription factor binding sites-modeling and integrative search methods. Bioinformatics doi
  27. (2006). Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res doi
  28. (2004). Finding functional sequence elements by multiple local alignment. Nucleic Acids Res doi
  29. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers.
  30. (2004). Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature doi
  31. (1995). G: REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell doi
  32. (2006). GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima.
  33. (2004). Gingeras TR: Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell doi
  34. (2000). GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. doi
  35. (1998). GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol doi
  36. (2007). Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics doi
  37. (1995). Introduction to Computational Biology Chapman and Hall, doi
  38. (1991). Introduction to protein structure doi
  39. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res doi
  40. (1993). JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science doi
  41. (2005). Makeev VJ: A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length. Bioinformatics doi
  42. (1992). McMorris FR: Critical comparison of consensus methods for molecular sequences. Nucleic Acids Res doi
  43. (2006). MEME: discovering and analyzing DNA and protein sequence motifs. doi
  44. (2009). Methylation and deamination of CpGs generate p53-binding sites on a genomic scale. Trends Genet doi
  45. (2006). Ng HH: The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet doi
  46. (1994). Pabo CO: Crystal structure of the Oct-1 POU domain bound to an octamer site: DNA recognition with tethered DNA-binding modules. Cell doi
  47. (2005). PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol doi
  48. (2004). PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. doi
  49. (2006). RA: Control of developmental regulators by Polycomb in human embryonic stem cells. Cell doi
  50. (2005). RA: Core transcriptional regulatory circuitry in human embryonic stem cells. Cell doi
  51. (2007). RS: Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell doi
  52. (2007). S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods doi
  53. (2008). Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods doi
  54. (2007). Siggia ED: Connecting protein structure with predictions of regulatory sites. doi
  55. (2003). Sp1- and Krüppel-like transcription factors. Genome Biol doi
  56. (2008). Sp1: Emerging roles-Beyond constitutive activation of TATAless housekeeping genes. doi
  57. (1995). Spacing of palindromic half sites as a determinant of selective STAT (signal transducers and activators of transcription) DNA binding and transcriptional activity. Proc Natl Acad Sci USA doi
  58. (2000). Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput doi
  59. (2003). Stormo GD: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics doi
  60. (1990). Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science doi
  61. (2008). T: W-AlignACE: an improved Gibbs sampling algorithm based on more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data. Bioinformatics doi
  62. TJ: Crystal structure of MEF2A doi
  63. (2005). TJP: NestedMICA: sensitive inference of overrepresented motifs in nucleic acid sequence. Nucleic Acids Res doi
  64. (2008). TL: Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput Biol doi
  65. (2007). TL: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. doi
  66. (2008). TRANSFAC: New ChIP-on-chip data.
  67. (1996). Veretnik S: Identification of sequence pattern with profile analysis. Methods Enzymol doi
  68. (2009). Vetrie D: Functional diversity for REST (NRSF) is defined by in vivo binding affinity hierarchies at the DNA sequence level. Genome Res doi
  69. (2003). Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res doi
  70. (2000). WJ: DNA binding site selection of dimeric and tetrameric Stat5 proteins reveals a large repertoire of divergent tetrameric Stat5a binding sites. Mol Cell Biol doi
  71. (2008). WJ: Priming for T helper type 2 differentiation by interleukin 2-mediated induction of interleukin 4 receptor alpha-chain expression. Nat Immunol doi
  72. (2002). Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. doi
  73. (2006). Y: A global map of p53 transcription-factor binding sites in the human genome. Cell doi
  74. (2004). Y: NMR structure of transcription factor Sp1 DNA binding domain. Biochemistry doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.