9,967 research outputs found

    Transcription Factor-DNA Binding Via Machine Learning Ensembles

    Full text link
    We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

    CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

    Get PDF
    We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification

    Enhanced maps of transcription factor binding sites improve regulatory networks learned from accessible chromatin data

    Get PDF
    Determining where transcription factors (TFs) bind in genomes provides insight into which transcriptional programs are active across organs, tissue types, and environmental conditions. Recent advances in high-throughput profiling of regulatory DNA have yielded large amounts of information about chromatin accessibility. Interpreting the functional significance of these data sets requires knowledge of which regulators are likely to bind these regions. This can be achieved by using information about TF-binding preferences, or motifs, to identify TF-binding events that are likely to be functional. Although different approaches exist to map motifs to DNA sequences, a systematic evaluation of these tools in plants is missing. Here, we compare four motif-mapping tools widely used in the Arabidopsis (Arabidopsis thaliana) research community and evaluate their performance using chromatin immunoprecipitation data sets for 40 TFs. Downstream gene regulatory network (GRN) reconstruction was found to be sensitive to the motif mapper used. We further show that the low recall of Find Individual Motif Occurrences, one of the most frequently used motif-mapping tools, can be overcome by using an Ensemble approach, which combines results from different mapping tools. Several examples are provided demonstrating how the Ensemble approach extends our view on transcriptional control for TFs active in different biological processes. Finally, a protocol is presented to effectively derive more complete cell type-specific GRNs through the integrative analysis of open chromatin regions, known binding site information, and expression data sets. This approach will pave the way to increase our understanding of GRNs in different cellular conditions
    • …
    corecore