16 research outputs found

    GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments

    Get PDF
    Summary: Accurate prediction of transcription factor binding motifs that are enriched in a collection of sequences remains a computational challenge. Here we report on GimmeMotifs, a pipeline that incorporates an ensemble of computational tools to predict motifs de novo from ChIP-sequencing (ChIP-seq) data. Similar redundant motifs are compared using the weighted information content (WIC) similarity score and clustered using an iterative procedure. A comprehensive output report is generated with several different evaluation metrics to compare and evaluate the results. Benchmarks show that the method performs well on human and mouse ChIP-seq datasets. GimmeMotifs consists of a suite of command-line scripts that can be easily implemented in a ChIP-seq analysis pipeline

    Transcription Factor-DNA Binding Via Machine Learning Ensembles

    Full text link
    We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

    Explicit equilibrium modeling of transcription-factor binding and gene regulation

    Get PDF
    We have developed a computational model that predicts the probability of transcription factor binding to any site in the genome. GOMER (generalizable occupancy model of expression regulation) calculates binding probabilities on the basis of position weight matrices, and incorporates the effects of cooperativity and competition by explicit calculation of coupled binding equilibria. GOMER can be used to test hypotheses regarding gene regulation that build upon this physically principled prediction of protein-DNA binding

    Characterization of genome-wide p53-binding sites upon stress response

    Get PDF
    The tumor suppressor p53 is a sequence-specific transcription factor, which regulates the expression of target genes involved in different stress responses. To understand p53's essential transcriptional functions, unbiased analysis of its DNA-binding repertoire is pivotal. In a genome-wide tiling ChIP-on-chip approach, we have identified and characterized 1546 binding sites of p53 upon Actinomycin D treatment. Among those binding sites were known as well as novel p53 target sites, which included regulatory regions of potentially novel transcripts. Using this collection of genome-wide binding sites, a new high-confidence algorithm was developed, p53scan, to identify the p53 consensus-binding motif. Strikingly, this motif was present in the majority of all bound sequences with 83% of all binding sites containing the motif. In the surrounding sequences of the binding sites, several motifs for potential regulatory cobinders were identified. Finally, we show that the majority of the genome-wide p53 target sites can also be bound by overexpressed p63 and p73 in vivo, suggesting that they can possibly play an important role at p53 binding sites. This emphasizes the possible interplay of p53 and its family members in the context of target gene binding. Our study greatly expands the known, experimentally validated p53 binding site repertoire and serves as a valuable knowledgebase for future research

    Molecular determinants of caste differentiation in the highly eusocial honeybee Apis mellifera

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In honeybees, differential feeding of female larvae promotes the occurrence of two different phenotypes, a queen and a worker, from identical genotypes, through incremental alterations, which affect general growth, and character state alterations that result in the presence or absence of specific structures. Although previous studies revealed a link between incremental alterations and differential expression of physiometabolic genes, the molecular changes accompanying character state alterations remain unknown.</p> <p>Results</p> <p>By using cDNA microarray analyses of >6,000 <it>Apis mellifera </it>ESTs, we found 240 differentially expressed genes (DEGs) between developing queens and workers. Many genes recorded as up-regulated in prospective workers appear to be unique to <it>A. mellifera</it>, suggesting that the workers' developmental pathway involves the participation of novel genes. Workers up-regulate more developmental genes than queens, whereas queens up-regulate a greater proportion of physiometabolic genes, including genes coding for metabolic enzymes and genes whose products are known to regulate the rate of mass-transforming processes and the general growth of the organism (e.g., <it>tor</it>). Many DEGs are likely to be involved in processes favoring the development of caste-biased structures, like brain, legs and ovaries, as well as genes that code for cytoskeleton constituents. Treatment of developing worker larvae with juvenile hormone (JH) revealed 52 JH responsive genes, specifically during the critical period of caste development. Using Gibbs sampling and Expectation Maximization algorithms, we discovered eight overrepresented <it>cis</it>-elements from four gene groups. Graph theory and complex networks concepts were adopted to attain powerful graphical representations of the interrelation between <it>cis</it>-elements and genes and objectively quantify the degree of relationship between these entities.</p> <p>Conclusion</p> <p>We suggest that clusters of functionally related DEGs are co-regulated during caste development in honeybees. This network of interactions is activated by nutrition-driven stimuli in early larval stages. Our data are consistent with the hypothesis that JH is a key component of the developmental determination of queen-like characters. Finally, we propose a conceptual model of caste differentiation in <it>A. mellifera </it>based on gene-regulatory networks.</p

    Transcription factor-DNA binding via machine learning ensembles

    Full text link
    The network of interactions between transcription factors (TFs) and their regulatory gene targets governs many of the behaviors and responses of cells. Construction of a transcriptional regulatory network involves three interrelated problems, defined for any regulator: finding (1) its target genes, (2) its binding motif and (3) its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, performance of algorithms for these has not been consistent for all transcription factors. Because machine learning algorithms have shown advantages in integrating information of different types, we investigate a machine-based approach to integrating predictions from an ensemble of commonly used motif exploration algorithms.Published versio

    The four hexamerin genes in the honey bee: structure, molecular evolution and function deduced from expression patterns in queens, workers and drones

    Get PDF
    Background: Hexamerins are hemocyanin-derived proteins that have lost the ability to bind copper ions and transport oxygen; instead, they became storage proteins. The current study aimed to broaden our knowledge on the hexamerin genes found in the honey bee genome by exploring their structural characteristics, expression profiles, evolution, and functions in the life cycle of workers, drones and queens. Results: The hexamerin genes of the honey bee (hex 70a, hex 70b, hex 70c and hex 110) diverge considerably in structure, so that the overall amino acid identity shared among their deduced protein subunits varies from 30 to 42%. Bioinformatics search for motifs in the respective upstream control regions (UCRs) revealed six overrepresented motifs including a potential binding site for Ultraspiracle (Usp), a target of juvenile hormone (JH). The expression of these genes was induced by topical application of JH on worker larvae. The four genes are highly transcribed by the larval fat body, although with significant differences in transcript levels, but only hex 110 and hex 70a are re-induced in the adult fat body in a caste-and sex-specific fashion, workers showing the highest expression. Transcripts for hex 110, hex 70a and hex70b were detected in developing ovaries and testes, and hex 110 was highly transcribed in the ovaries of egg-laying queens. A phylogenetic analysis revealed that HEX 110 is located at the most basal position among the holometabola hexamerins, and like HEX 70a and HEX 70c, it shares potential orthology relationship with hexamerins from other hymenopteran species. Conclusions: Striking differences were found in the structure and developmental expression of the four hexamerin genes in the honey bee. The presence of a potential binding site for Usp in the respective 5' UCRs, and the results of experiments on JH level manipulation in vivo support the hypothesis of regulation by JH. Transcript levels and patterns in the fat body and gonads suggest that, in addition to their primary role in supplying amino acids for metamorphosis, hexamerins serve as storage proteins for gonad development, egg production, and to support foraging activity. A phylogenetic analysis including the four deduced hexamerins and related proteins revealed a complex pattern of evolution, with independent radiation in insect orders.Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP)[05/03926-5; 08/00541-3

    Transcription factor motif quality assessment requires systematic comparative analysis

    Get PDF
    Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis

    Transcription factor motif quality assessment requires systematic comparative analysis [version 2; referees: 2 approved]

    Get PDF
    Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis
    corecore