70,612 research outputs found

    Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction

    Get PDF
    Prediction of transcription factor binding sites is an important challenge in genome analysis. The advent of next generation genome sequencing technologies makes the development of effective computational approaches particularly imperative. We have developed a novel training-based methodology intended for prokaryotic transcription factor binding site prediction. Our methodology extends existing models by taking into account base interdependencies between neighbouring positions using conditional probabilities and includes genomic background weighting. This has been tested against other existing and novel methodologies including position-specific weight matrices, first-order Hidden Markov Models and joint probability models. We have also tested the use of gapped and ungapped alignments and the inclusion or exclusion of background weighting. We show that our best method enhances binding site prediction for all of the 22 Escherichia coli transcription factors with at least 20 known binding sites, with many showing substantial improvements. We highlight the advantage of using block alignments of binding sites over gapped alignments to capture neighbouring position interdependencies. We also show that combining these methods with ChIP-on-chip data has the potential to further improve binding site prediction. Finally we have developed the ungapped likelihood under positional background platform: a user friendly website that gives access to the prediction method devised in this work

    On Weight Matrix and Free Energy Models for Sequence Motif Detection

    Full text link
    The problem of motif detection can be formulated as the construction of a discriminant function to separate sequences of a specific pattern from background. In computational biology, motif detection is used to predict DNA binding sites of a transcription factor (TF), mostly based on the weight matrix (WM) model or the Gibbs free energy (FE) model. However, despite the wide applications, theoretical analysis of these two models and their predictions is still lacking. We derive asymptotic error rates of prediction procedures based on these models under different data generation assumptions. This allows a theoretical comparison between the WM-based and the FE-based predictions in terms of asymptotic efficiency. Applications of the theoretical results are demonstrated with empirical studies on ChIP-seq data and protein binding microarray data. We find that, irrespective of underlying data generation mechanisms, the FE approach shows higher or comparable predictive power relative to the WM approach when the number of observed binding sites used for constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table

    Simple SVM based whole-genome segmentation

    Get PDF
    We present a support vector machine (SVM) based framework for DNA segmentation into binary classes. Two applications are explored: transcription start site prediction and transcription factor binding prediction. Experiments demonstrate our approach has significantly better performance than other methods on both tasks

    Novel Sequence-Based Method for Identifying Transcription Factor Binding Sites in Prokaryotic Genomes

    Get PDF
    Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next–generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be experimentally probed. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor binding specificities. The prototypical prokaryotic transcription factor: TF) contains a helix–turn–helix: HTH) fold and bind DNA as homodimers, leading to their palindromic motif specificities. The connection between the TF and its promoter is based on the autoregulation phenomenon noticed in E. coli. Approximately 55% of the TFs analyzed were estimated to be autoregulated. Our preliminary analysis using RegulonDB indicates that this value increases to 79% if one considers the neighboring operons. Given the TF family of interest, it is necessary to find the relevant TF proteins and their associated genomes. Due to the scale–free network topology of prokaryotic systems, many of the transcriptional regulators regulate only one or a few operons. Within a single genome, there would not be enough sequence–based signal to determine the binding site using standard computational methods. Therefore, multiple bacterial genomes are used to overcome this lack of signal within a single genome. We use a distance–based criteria to define the operon boundaries and their respective promoters. Several TF–DNA crystal structures are then used to determine the residues that interact with the DNA. These key residues are the basis for the TF comparison metric; the assumption being that similar residues should impart similar DNA binding specificities. After defining the sets of TF clusters using this metric, their respective promoters are used as input to a motif finding procedure. This method has currently been tested on the LacI and TetR TF families with successful results. On external validation sets, the specificity of prediction is ∌80%. These results are important in developing methods to define the DNA binding preferences of the TF protein residues, known as the “recognition code”. This “recognition code” would allow computational design and prediction of novel DNA–binding specificities, enabling protein-engineering and synthetic biology applications

    TFBSTools: an R/bioconductor package for transcription factor binding site analysis.

    Get PDF
    Summary: The ability to efficiently investigate transcription factor binding sites (TFBSs) genome-wide is central to computational studies of gene regulation. TFBSTools is an R/Bioconductor package for the analysis and manipulation of TFBSs and their associated transcription factor profile matrices. TFBStools provides a toolkit for handling TFBS profile matrices, scanning sequences and alignments including whole genomes, and querying the JASPAR database. The functionality of the package can be easily extended to include advanced statistical analysis, data visualization and data integration. Availability and implementation: The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/TFBSTools/). Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Differences in transcription between free-living and CO_2-activated third-stage larvae of Haemonchus contortus

    Get PDF
    Background: The disease caused by Haemonchus contortus, a blood-feeding nematode of small ruminants, is of major economic importance worldwide. The infective third-stage larva (L3) of this gastric nematode is enclosed in a cuticle (sheath) and, once ingested with herbage by the host, undergoes an exsheathment process that marks the transition from the free-living (L3) to the parasitic (xL3) stage. This study explored changes in gene transcription associated with this transition and predicted, based on comparative analysis, functional roles for key transcripts in the metabolic pathways linked to larval development. Results: Totals of 101,305 (L3) and 105,553 (xL3) expressed sequence tags (ESTs) were determined using 454 sequencing technology, and then assembled and annotated; the most abundant transcripts encoded transthyretin-like, calcium-binding EF-hand, NAD(P)-binding and nucleotide-binding proteins as well as homologues of Ancylostoma-secreted proteins (ASPs). Using an in silico-subtractive analysis, 560 and 685 sequences were shown to be uniquely represented in the L3 and xL3 stages, respectively; the transcripts encoded ribosomal proteins, collagens and elongation factors (in L3), and mainly peptidases and other enzymes of amino acid catabolism (in xL3). Caenorhabditis elegans orthologues of transcripts that were uniquely transcribed in each L3 and xL3 were predicted to interact with a total of 535 other genes, all of which were involved in embryonic development. Conclusion: The present study indicated that some key transcriptional alterations taking place during the transition from the L3 to the xL3 stage of H. contortus involve genes predicted to be linked to the development of neuronal tissue (L3 and xL3), formation of the cuticle (L3) and digestion of host haemoglobin (xL3). Future efforts using next-generation sequencing and bioinformatic technologies should provide the efficiency and depth of coverage required for the determination of the complete transcriptomes of different developmental stages and/or tissues of H. contortus as well as the genome of this important parasitic nematode. Such advances should lead to a significantly improved understanding of the molecular biology of H. contortus and, from an applied perspective, to novel methods of intervention

    Predicting variation of DNA shape preferences in protein-DNA interaction in cancer cells with a new biophysical model

    Full text link
    DNA shape readout is an important mechanism of target site recognition by transcription factors, in addition to the sequence readout. Several models of transcription factor-DNA binding which consider DNA shape have been developed in recent years. We present a new biophysical model of protein-DNA interaction by considering the DNA shape features, which is based on a neighbour dinucleotide dependency model BayesPI2. The parameters of the new model are restricted to a subspace spanned by the 2-mer DNA shape features, which allowing a biophysical interpretation of the new parameters as position-dependent preferences towards certain values of the features. Using the new model, we explore the variation of DNA shape preferences in several transcription factors across cancer cell lines and cellular conditions. We find evidence of DNA shape variations at FOXA1 binding sites in MCF7 cells after treatment with steroids. The new model is useful for elucidating finer details of transcription factor-DNA interaction. It may be used to improve the prediction of cancer mutation effects in the future
    • 

    corecore