1,577 research outputs found

    Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments

    Get PDF
    Background: High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm. Results: We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches. Conclusion: The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10(-5), an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively

    Genomic characterization of Gli-activator targets in sonic hedgehog-mediated neural patterning

    Get PDF
    Sonic hedgehog (Shh) acts as a morphogen to mediate the specification of distinct cell identities in the ventral neural tube through a Gli-mediated (Gli1-3) transcriptional network. Identifying Gli targets in a systematic fashion is central to the understanding of the action of Shh. We examined this issue in differentiating neural progenitors in mouse. An epitope-tagged Gli-activator protein was used to directly isolate cis-regulatory sequences by chromatin immunoprecipitation (ChIP). ChIP products were then used to screen custom genomic tiling arrays of putative Hedgehog (Hh) targets predicted from transcriptional profiling studies, surveying 50-150 kb of non-transcribed sequence for each candidate. In addition to identifying expected Gli-target sites, the data predicted a number of unreported direct targets of Shh action. Transgenic analysis of binding regions in Nkx2.2, Nkx2.1 (Titf1) and Rab34 established these as direct Hh targets. These data also facilitated the generation of an algorithm that improved in silico predictions of Hh target genes. Together, these approaches provide significant new insights into both tissue-specific and general transcriptional targets in a crucial Shh-mediated patterning process

    Discovering Motifs in Ranked Lists of DNA Sequences

    Get PDF
    Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim

    Genome-Wide Analysis of Binding Sites and Direct Target Genes of the Orphan Nuclear Receptor NR2F1/COUP-TFI

    Get PDF
    Identification of bona fide direct nuclear receptor gene targets has been challenging but essential for understanding regulation of organismal physiological processes.We describe a methodology to identify transcription factor binding sites and target genes in vivo by intersecting microarray data, computational binding site queries, and evolutionary conservation. We provide detailed experimental validation of each step and, as a proof of principle, utilize the methodology to identify novel direct targets of the orphan nuclear receptor NR2F1 (COUP-TFI). The first step involved validation of microarray gene expression profiles obtained from wild-type and COUP-TFI(-/-) inner ear tissues. Secondly, we developed a bioinformatic tool to search for COUP-TFI DNA binding sites in genomes, using a classification-type Hidden Markov Model trained with 49 published COUP-TF response elements. We next obtained a ranked list of candidate in vivo direct COUP-TFI targets by integrating the microarray and bioinformatics analyses according to the degree of binding site evolutionary conservation and microarray statistical significance. Lastly, as proof-of-concept, 5 specific genes were validated for direct regulation. For example, the fatty acid binding protein 7 (Fabp7) gene is a direct COUP-TFI target in vivo because: i) we identified 2 conserved COUP-TFI binding sites in the Fabp7 promoter; ii) Fapb7 transcript and protein levels are significantly reduced in COUP-TFI(-/-) tissues and in MEFs; iii) chromatin immunoprecipitation demonstrates that COUP-TFI is recruited to the Fabp7 promoter in vitro and in vivo and iv) it is associated with active chromatin having increased H3K9 acetylation and enrichment for CBP and SRC-1 binding in the newborn brain.We have developed and validated a methodology to identify in vivo direct nuclear receptor target genes. This bioinformatics tool can be modified to scan for response elements of transcription factors, cis-regulatory modules, or any flexible DNA pattern

    Transcriptional regulation of neurogenesis by the proneural factor Ascl1

    Get PDF
    Dissertação de mestrado BioinformaticsThis project aims to provide a better understanding of the transcriptional regulation of neurogenesis by the proneural factor Ascl1. The first genome-wide characterization of Ascl1 transcriptional program in the embryonic mouse brain was performed by ChIP-chip. However, the restriction to proximal promoter regions, excluding genes bound by Ascl1 to distal enhancers, and the need to validate the model with a more robust experimental approach, prompted the use of ChIP-seq. Genome-wide mapping of Ascl1 binding sites with higher resolution, reveals 3054 high confidence binding regions in ventral telencephalon. The chromatin states of genomic regions associated with Ascl1 recruitment were also characterised, concluding that these bear marks of distal enhancers, but also proximal promoter regions. Further integration of expression profiling data from Ascl1 LoF experiments identifies 643 target genes. Results from functional annotation of these targets corroborate previous findings, showing that Ascl1 coordinates neurogenesis by regulating a large number of target genes with a wide variety of biological functions, and associated with different stages of neurogenesis. Additional investigations should address how Ascl1 coordinates this complex transcriptional program along the neuronal lineage. This could explore a possible crosstalk with the Notch program, taking advantage of the 105 regulatory regions identified where Ascl1 is co-recruited by RBPJ, as assessed by ChIP-seq.O objetivo principal deste projeto consiste em compreender melhor a regulação transcricional da neurogénese pelo fator proneural Ascl1. A primeira caracterização à escala do genoma do programa de transcrição do Ascl1 no cérebro de embriões de ratinho foi realizada pela técnica de ChIP-chip. No entanto, a restrição a regiões próximas do promotor, com exclusão de genes ligados pelo Ascl1 a distal enhancers, e a necessidade de validar o modelo com uma abordagem experimental mais robusta, motivou o recurso à técnica de ChIP-seq. A análise de localização, com alta resolução, ao longo de todo o genoma para sítios de ligação do Ascl1, revelou 3054 regiões de ligação de elevada confiança no telencéfalo do ratinho. De seguida, caracterizaram-se os chromatin states de regiões genómicas associadas com o recrutamento do Ascl1. Desta análise conclui-se que estas regiões possuem marcas de distal enhancers, mas também de regiões próximas do promotor. A posterior integração de perfis de expressão em experiências de perda-de-função para o Ascl1 identificou 643 genes alvo. Os resultados da anotação funcional desses alvos corroboram as conclusões anteriormente publicadas, mostrando que o Ascl1 coordena a neurogénese através da regulação de um grande número de genes alvo, com uma ampla diversidade de funções biológicas, associados a diferentes fases da neurogénese. Estudos futuros deem abordar de que forma o Ascl1 coordena este programa de transcrição complexo ao longo da linhagem neuronal. Tal poderia explorar um possível crosstalk com o programa Notch, tirando partido das 105 regiões regulatórias identificadas por ChIP-seq, onde o Ascl1 é co-recrutado pelo RBPJ

    On the detection and refinement of transcription factor binding sites using ChIP-Seq data

    Get PDF
    Coupling chromatin immunoprecipitation (ChIP) with recently developed massively parallel sequencing technologies has enabled genome-wide detection of protein–DNA interactions with unprecedented sensitivity and specificity. This new technology, ChIP-Seq, presents opportunities for in-depth analysis of transcription regulation. In this study, we explore the value of using ChIP-Seq data to better detect and refine transcription factor binding sites (TFBS). We introduce a novel computational algorithm named Hybrid Motif Sampler (HMS), specifically designed for TFBS motif discovery in ChIP-Seq data. We propose a Bayesian model that incorporates sequencing depth information to aid motif identification. Our model also allows intra-motif dependency to describe more accurately the underlying motif pattern. Our algorithm combines stochastic sampling and deterministic ‘greedy’ search steps into a novel hybrid iterative scheme. This combination accelerates the computation process. Simulation studies demonstrate favorable performance of HMS compared to other existing methods. When applying HMS to real ChIP-Seq datasets, we find that (i) the accuracy of existing TFBS motif patterns can be significantly improved; and (ii) there is significant intra-motif dependency inside all the TFBS motifs we tested; modeling these dependencies further improves the accuracy of these TFBS motif patterns. These findings may offer new biological insights into the mechanisms of transcription factor regulation

    Inferring Condition-Specific Targets of Human TF-TF Complexes Using ChIP-seq Data

    Get PDF
    Background: Transcription factors (TFs) often interact with one another to form TF complexes that bind DNA and regulate gene expression. Many databases are created to describe known TF complexes identified by either mammalian two-hybrid experiments or data mining. Lately, a wealth of ChIP-seq data on human TFs under different experiment conditions are available, making it possible to investigate condition-specific (cell type and/or physiologic state) TF complexes and their target genes. Results: Here, we developed a systematic pipeline to infer Condition-Specific Targets of human TF-TF complexes (called the CST pipeline) by integrating ChIP-seq data and TF motifs. In total, we predicted 2,392 TF complexes and 13,504 high-confidence or 127,994 low-confidence regulatory interactions amongst TF complexes and their target genes. We validated our predictions by (i) comparing predicted TF complexes to external TF complex databases, (ii) validating selected target genes of TF complexes using ChIP-qPCR and RT-PCR experiments, and (iii) analysing target genes of select TF complexes using gene ontology enrichment to demonstrate the accuracy of our work. Finally, the predicted results above were integrated and employed to construct a CST database. Conclusions: We built up a methodology to construct the CST database, which contributes to the analysis of transcriptional regulation and the identification of novel TF-TF complex formation in a certain condition. This database also allows users to visualize condition-specific TF regulatory networks through a user-friendly web interface

    High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

    Get PDF
    Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding

    Transcription factor binding specificity and occupancy : elucidation, modelling and evaluation

    Get PDF
    The major contributions of this thesis are addressing the need for an objective quality evaluation of a transcription factor binding model, demonstrating the value of the tools developed to this end and elucidating how in vitro and in vivo information can be utilized to improve TF binding specificity models. Accurate elucidation of TF binding specificity remains an ongoing challenge in gene regulatory research. Several in vitro and in vivo experimental techniques have been developed followed by a proliferation of algorithms, and ultimately, the binding models. This increase led to a choice problem for the end users: which tools to use, and which is the most accurate model for a given TF? Therefore, the first section of this thesis investigates the motif assessment problem: how scoring functions, choice and processing of benchmark data, and statistics used in evaluation affect motif ranking. This analysis revealed that TF motif quality assessment requires a systematic comparative analysis, and that scoring functions used have a TF-specific effect on motif ranking. These results advised the design of a Motif Assessment and Ranking Suite MARS, supported by PBM and ChIP-seq benchmark data and an extensive collection of PWM motifs. MARS implements consistency, enrichment, and scoring and classification-based motif evaluation algorithms. Transcription factor binding is also influenced and determined by contextual factors: chromatin accessibility, competition or cooperation with other TFs, cell line or condition specificity, binding locality (e.g. proximity to transcription start sites) and the shape of the binding site (DNA-shape). In vitro techniques do not capture such context; therefore, this thesis also combines PBM and DNase-seq data using a comparative k-mer enrichment approach that compares open chromatin with genome-wide prevalence, achieving a modest performance improvement when benchmarked on ChIP-seq data. Finally, since statistical and probabilistic methods cannot capture all the information that determine binding, a machine learning approach (XGBooost) was implemented to investigate how the features contribute to TF specificity and occupancy. This combinatorial approach improves the predictive ability of TF specificity models with the most predictive feature being chromatin accessibility, while the DNA-shape and conservation information all significantly improve on the baseline model of k-mer and DNase data. The results and the tools introduced in this thesis are useful for systematic comparative analysis (via MARS) and a combinatorial approach to modelling TF binding specificity, including appropriate feature engineering practices for machine learning modelling
    corecore