1,671 research outputs found

    Predicting Combinatorial Binding of Transcription Factors to Regulatory Elements in the Human Genome by Association Rule Mining

    Get PDF
    Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cisregulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment. Results: Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature. Conclusion: Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.NIAAA Alcohol Training GrantNational Science FoundationCellular and Molecular Biolog

    Modular combinatorial binding among human trans-acting factors reveals direct and indirect factor binding

    Get PDF
    Background The combinatorial binding of trans-acting factors (TFs) to the DNA is critical to the spatial and temporal specificity of gene regulation. For certain regulatory regions, more than one regulatory module (set of TFs that bind together) are combined to achieve context-specific gene regulation. However, previous approaches are limited to either pairwise TF co-association analysis or assuming that only one module is used in each regulatory region. Results We present a new computational approach that models the modular organization of TF combinatorial binding. Our method learns compact and coherent regulatory modules from in vivo binding data using a topic model. We found that the binding of 115 TFs in K562 cells can be organized into 49 interpretable modules. Furthermore, we found that tens of thousands of regulatory regions use multiple modules, a structure that cannot be observed with previous hard clustering based methods. The modules discovered recapitulate many published protein-protein physical interactions, have consistent functional annotations of chromatin states, and uncover context specific co-binding such as gene proximal binding of NFY + FOS + SP and distal binding of NFY + FOS + USF. For certain TFs, the co-binding partners of direct binding (motif present) differs from those of indirect binding (motif absent); the distinct set of co-binding partners can predict whether the TF binds directly or indirectly with up to 95% accuracy. Joint analysis across two cell types reveals both cell-type-specific and shared regulatory modules. Conclusions Our results provide comprehensive cell-type-specific combinatorial binding maps and suggest a modular organization of combinatorial binding. Keywords Computational genomics Transcription factor Combinatorial binding Direct and indirect binding Topic modelNational Institutes of Health (U.S.) (grant 1U01HG007037-01

    Associative Pattern Recognition for Biological Regulation Data

    Get PDF
    In the last decade, bioinformatics data has been accumulated at an unprecedented rate, thanks to the advancement in sequencing technologies. Such rapid development poses both challenges and promising research topics. In this dissertation, we propose a series of associative pattern recognition algorithms in biological regulation studies. In particular, we emphasize efficiently recognizing associative patterns between genes, transcription factors, histone modifications and functional labels using heterogeneous data sources (numeric, sequences, time series data and textual labels). In protein-DNA associative pattern recognition, we introduce an efficient algorithm for affinity test by searching for over-represented DNA sequences using a hash function and modulo addition calculation. This substantially improves the efficiency of \textit{next generation sequencing} data analysis. In gene regulatory network inference, we propose a framework for refining weak networks based on transcription factor binding sites, thus improved the precision of predicted edges by up to 52%. In histone modification code analysis, we propose an approach to genome-wide combinatorial pattern recognition for histone code to function associative pattern recognition, and achieved improvement by up to 38.1%38.1\%. We also propose a novel shape based modification pattern analysis approach, using this to successfully predict sub-classes of genes in flowering-time category. We also propose a combination to combination associative pattern recognition, and achieved better performance compared against multi-label classification and bidirectional associative memory methods. Our proposed approaches recognize associative patterns from different types of data efficiently, and provides a useful toolbox for biological regulation analysis. This dissertation presents a road-map to associative patterns recognition at genome wide level

    Simplified Method to Predict Mutual Interactions of Human Transcription Factors Based on Their Primary Structure

    Get PDF
    Background: Physical interactions between transcription factors (TFs) are necessary for forming regulatory protein complexes and thus play a crucial role in gene regulation. Currently, knowledge about the mechanisms of these TF interactions is incomplete and the number of known TF interactions is limited. Computational prediction of such interactions can help identify potential new TF interactions as well as contribute to better understanding the complex machinery involved in gene regulation. Methodology: We propose here such a method for the prediction of TF interactions. The method uses only the primary sequence information of the interacting TFs, resulting in a much greater simplicity of the prediction algorithm. Through an advanced feature selection process, we determined a subset of 97 model features that constitute the optimized model in the subset we considered. The model, based on quadratic discriminant analysis, achieves a prediction accuracy of 85.39 % on a blind set of interactions. This result is achieved despite the selection for the negative data set of only those TF from the same type of proteins, i.e. TFs that function in the same cellular compartment (nucleus) and in the same type of molecular process (transcription initiation). Such selection poses significant challenges for developing models with high specificity, but at the same time better reflects real-world problems. Conclusions: The performance of our predictor compares well to those of much more complex approaches for predicting TF and general protein-protein interactions, particularly when taking the reduced complexity of model utilisation into account

    Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data

    Get PDF
    Transcriptional cooperativity among several transcription factors (TFs) is believed to be the main mechanism of complexity and precision in transcriptional regulatory programs. Here, we present a Bayesian network framework to reconstruct a high-confidence whole-genome map of transcriptional cooperativity in Saccharomyces cerevisiae by integrating a comprehensive list of 15 genomic features. We design a Bayesian network structure to capture the dominant correlations among features and TF cooperativity, and introduce a supervised learning framework with a well-constructed gold-standard dataset. This framework allows us to assess the predictive power of each genomic feature, validate the superior performance of our Bayesian network compared to alternative methods, and integrate genomic features for optimal TF cooperativity prediction. Data integration reveals 159 high-confidence predicted cooperative relationships among 105 TFs, most of which are subsequently validated by literature search. The existing and predicted transcriptional cooperativities can be grouped into three categories based on the combination patterns of the genomic features, providing further biological insights into the different types of TF cooperativity. Our methodology is the first supervised learning approach for predicting transcriptional cooperativity, compares favorably to alternative unsupervised methodologies, and can be applied to other genomic data integration tasks where high-quality gold-standard positive data are scarce

    Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data

    Get PDF
    Transcriptional cooperativity among several transcription factors (TFs) is believed to be the main mechanism of complexity and precision in transcriptional regulatory programs. Here, we present a Bayesian network framework to reconstruct a high-confidence whole-genome map of transcriptional cooperativity in Saccharomyces cerevisiae by integrating a comprehensive list of 15 genomic features. We design a Bayesian network structure to capture the dominant correlations among features and TF cooperativity, and introduce a supervised learning framework with a well-constructed gold-standard dataset. This framework allows us to assess the predictive power of each genomic feature, validate the superior performance of our Bayesian network compared to alternative methods, and integrate genomic features for optimal TF cooperativity prediction. Data integration reveals 159 high-confidence predicted cooperative relationships among 105 TFs, most of which are subsequently validated by literature search. The existing and predicted transcriptional cooperativities can be grouped into three categories based on the combination patterns of the genomic features, providing further biological insights into the different types of TF cooperativity. Our methodology is the first supervised learning approach for predicting transcriptional cooperativity, compares favorably to alternative unsupervised methodologies, and can be applied to other genomic data integration tasks where high-quality gold-standard positive data are scarce

    Hematopoietic gene promoters subjected to a group-combinatorial study of DNA samples: identification of a megakaryocytic selective DNA signature

    Get PDF
    Identification of common sub-sequences for a group of functionally related DNA sequences can shed light on the role of such elements in cell-specific gene expression. In the megakaryocytic lineage, no one single unique transcription factor was described as linage specific, raising the possibility that a cluster of gene promoter sequences presents a unique signature. Here, the megakaryocytic gene promoter group, which consists of both human and mouse 5′ non-coding regions, served as a case study. A methodology for group-combinatorial search has been implemented as a customized software platform. It extracts the longest common sequences for a group of related DNA sequences and allows for single gaps of varying length, as well as double- and multiple-gap sequences. The results point to common DNA sequences in a group of genes that is selectively expressed in megakaryocytes, and which does not appear in a large group of control, random and specific sequences. This suggests a role for a combination of these sequences in cell-specific gene expression in the megakaryocytic lineage. The data also point to an intrinsic cross-species difference in the organization of 5′ non-coding sequences within the mammalian genomes. This methodology may be used for the identification of regulatory sequences in other lineages

    Hematopoietic gene promoters subjected to a group-combinatorial study of DNA samples: identification of a megakaryocytic selective DNA signature

    Get PDF
    Identification of common sub-sequences for a group of functionally related DNA sequences can shed light on the role of such elements in cell-specific gene expression. In the megakaryocytic lineage, no one single unique transcription factor was described as linage specific, raising the possibility that a cluster of gene promoter sequences presents a unique signature. Here, the megakaryocytic gene promoter group, which consists of both human and mouse 5′ non-coding regions, served as a case study. A methodology for group-combinatorial search has been implemented as a customized software platform. It extracts the longest common sequences for a group of related DNA sequences and allows for single gaps of varying length, as well as double- and multiple-gap sequences. The results point to common DNA sequences in a group of genes that is selectively expressed in megakaryocytes, and which does not appear in a large group of control, random and specific sequences. This suggests a role for a combination of these sequences in cell-specific gene expression in the megakaryocytic lineage. The data also point to an intrinsic cross-species difference in the organization of 5′ non-coding sequences within the mammalian genomes. This methodology may be used for the identification of regulatory sequences in other lineages
    corecore