Skip to main content
Article thumbnail
Location of Repository

Use of ChIP-Seq data for the design of a multiple promoter-alignment method

By Ionas Erb, Juan R. González-Vallinas, Giovanni Bussotti, Enrique Blanco, Eduardo Eyras and Cédric Notredame


We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments

Topics: Methods Online
Publisher: Oxford University Press
OAI identifier:
Provided by: PubMed Central
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://www.pubmedcentral.nih.g... (external link)
  • Suggested articles


    1. (2011). A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives.
    2. (1993). A weight array method for splicing signal analysis.
    3. (1992). Amino acid substitution matrices from protein blocks.
    4. Assessing computational methods of cis-regulatory module prediction.
    5. (2004). Benchmarking tools for the alignment of functional noncoding DNA.
    6. (2011). BlastR–fast and accurate database searches for non-coding RNAs.
    7. (2004). CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting.
    8. (2010). Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix.
    9. (2007). DNA reference alignment benchmarks based on tertiary structure of encoded proteins.
    10. (2009). Ensembl’s 10th year.
    11. (2010). Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding.
    12. (2006). Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity.
    13. (2011). High conservation of transcription factor binding and evidence for combinatorial regulation across six Drosophila species.
    14. (1999). Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.
    15. (2009). Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues.
    16. (2007). Incorporating evolution of transcription factor binding sites into annotated alignments.
    17. (2009). Insights from genomic profiling of transcription factors.
    18. (2009). Local DNA topography correlates with functional noncoding regions of the human genome.
    19. (2004). MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution.
    20. (2009). Measuring transcription factor-binding site turnover: a maximum likelihood approach using phylogenies.
    21. (2010). Modeling the evolution of regulatory elements by simultaneous detection and alignment with phylogenetic pair HMMs.
    22. (2004). MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model.
    23. (2007). MORPH: probabilistic alignment combined with hidden Markov models of cis-regulatory modules.
    24. MotEvo: Integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences. Bioinformatics, doi:10.1093/bioinformatics/btr695 [epub ahead of print].
    25. (2007). Mulan: multiple-sequence alignment to predict functional elements in genomic sequences.
    26. (2005). Mulan: multiple-sequence local alignment and visualization for studying function and evolution.
    27. (2007). Multiple non-collinear TF-map alignments of promoter regions.
    28. (2003). Multiple sequence alignment with the Clustal series of programs.
    29. (2007). Multiple sequence alignment: in pursuit of homologous DNA positions.
    30. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput.
    31. (2002). Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors.
    32. (2009). Phylogenetic and functional assessment of orthologs inference projects and methods.
    33. (2007). Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools.
    34. (2005). PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny.
    35. (2005). ProbCons: Probabilistic consistency-based multiple sequence alignment.
    36. (2011). Pyicos: A versatile toolkit for the analysis of high-throughput sequencing data.
    37. (2008). Recent developments in the MAFFT multiple sequence alignment program.
    38. (2003). Sequencing and comparison of yeast species to identify genes and regulatory elements.
    39. (2004). Surveying phylogenetic footprints in large gene clusters: applications to Hox cluster duplications.
    40. (2005). Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals.
    41. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment.
    42. (2010). the greatly expanded open-access database of transcription factor binding profiles.
    43. (2010). Towards realistic benchmarks for multiple alignments of non-coding sequences.
    44. (2006). Transcription factor map alignment of promoter regions.
    45. (2006). TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.
    46. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.