70,612 research outputs found
Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction
Prediction of transcription factor binding sites is an important challenge in genome analysis. The advent of next generation genome sequencing technologies makes the development of effective computational approaches particularly imperative. We have developed a novel training-based methodology intended for prokaryotic transcription factor binding site prediction. Our methodology extends existing models by taking into account base interdependencies between neighbouring positions using conditional probabilities and includes genomic background weighting. This has been tested against other existing and novel methodologies including position-specific weight matrices, first-order Hidden Markov Models and joint probability models. We have also tested the use of gapped and ungapped alignments and the inclusion or exclusion of background weighting. We show that our best method enhances binding site prediction for all of the 22 Escherichia coli transcription factors with at least 20 known binding sites, with many showing substantial improvements. We highlight the advantage of using block alignments of binding sites over gapped alignments to capture neighbouring position interdependencies. We also show that combining these methods with ChIP-on-chip data has the potential to further improve binding site prediction. Finally we have developed the ungapped likelihood under positional background platform: a user friendly website that gives access to the prediction method devised in this work
On Weight Matrix and Free Energy Models for Sequence Motif Detection
The problem of motif detection can be formulated as the construction of a
discriminant function to separate sequences of a specific pattern from
background. In computational biology, motif detection is used to predict DNA
binding sites of a transcription factor (TF), mostly based on the weight matrix
(WM) model or the Gibbs free energy (FE) model. However, despite the wide
applications, theoretical analysis of these two models and their predictions is
still lacking. We derive asymptotic error rates of prediction procedures based
on these models under different data generation assumptions. This allows a
theoretical comparison between the WM-based and the FE-based predictions in
terms of asymptotic efficiency. Applications of the theoretical results are
demonstrated with empirical studies on ChIP-seq data and protein binding
microarray data. We find that, irrespective of underlying data generation
mechanisms, the FE approach shows higher or comparable predictive power
relative to the WM approach when the number of observed binding sites used for
constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table
Simple SVM based whole-genome segmentation
We present a support vector machine (SVM) based framework for DNA segmentation into binary classes. Two applications are explored: transcription start site prediction and transcription factor binding prediction. Experiments demonstrate our approach has significantly better performance than other methods on both tasks
Novel Sequence-Based Method for Identifying Transcription Factor Binding Sites in Prokaryotic Genomes
Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With nextâgeneration sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be experimentally probed. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor binding specificities. The prototypical prokaryotic transcription factor: TF) contains a helixâturnâhelix: HTH) fold and bind DNA as homodimers, leading to their palindromic motif specificities. The connection between the TF and its promoter is based on the autoregulation phenomenon noticed in E. coli. Approximately 55% of the TFs analyzed were estimated to be autoregulated. Our preliminary analysis using RegulonDB indicates that this value increases to 79% if one considers the neighboring operons. Given the TF family of interest, it is necessary to find the relevant TF proteins and their associated genomes. Due to the scaleâfree network topology of prokaryotic systems, many of the transcriptional regulators regulate only one or a few operons. Within a single genome, there would not be enough sequenceâbased signal to determine the binding site using standard computational methods. Therefore, multiple bacterial genomes are used to overcome this lack of signal within a single genome. We use a distanceâbased criteria to define the operon boundaries and their respective promoters. Several TFâDNA crystal structures are then used to determine the residues that interact with the DNA. These key residues are the basis for the TF comparison metric; the assumption being that similar residues should impart similar DNA binding specificities. After defining the sets of TF clusters using this metric, their respective promoters are used as input to a motif finding procedure. This method has currently been tested on the LacI and TetR TF families with successful results. On external validation sets, the specificity of prediction is âŒ80%. These results are important in developing methods to define the DNA binding preferences of the TF protein residues, known as the ârecognition codeâ. This ârecognition codeâ would allow computational design and prediction of novel DNAâbinding specificities, enabling protein-engineering and synthetic biology applications
TFBSTools: an R/bioconductor package for transcription factor binding site analysis.
Summary: The ability to efficiently investigate transcription factor binding sites (TFBSs) genome-wide is central to computational studies of gene regulation. TFBSTools is an R/Bioconductor package for the analysis and manipulation of TFBSs and their associated transcription factor profile matrices. TFBStools provides a toolkit for handling TFBS profile matrices, scanning sequences and alignments including whole genomes, and querying the JASPAR database. The functionality of the package can be easily extended to include advanced statistical analysis, data visualization and data integration. Availability and implementation: The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/TFBSTools/). Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online
Differences in transcription between free-living and CO_2-activated third-stage larvae of Haemonchus contortus
Background:
The disease caused by Haemonchus contortus, a blood-feeding nematode of small ruminants, is of major economic importance worldwide. The infective third-stage larva (L3) of this gastric nematode is enclosed in a cuticle (sheath) and, once ingested with herbage by the host, undergoes an exsheathment process that marks the transition from the free-living (L3) to the parasitic (xL3) stage. This study explored changes in gene transcription associated with this transition and predicted, based on comparative analysis, functional roles for key transcripts in the metabolic pathways linked to larval development.
Results:
Totals of 101,305 (L3) and 105,553 (xL3) expressed sequence tags (ESTs) were determined using 454 sequencing technology, and then assembled and annotated; the most abundant transcripts encoded transthyretin-like, calcium-binding EF-hand, NAD(P)-binding and nucleotide-binding proteins as well as homologues of Ancylostoma-secreted proteins (ASPs). Using an in silico-subtractive analysis, 560 and 685 sequences were shown to be uniquely represented in the L3 and xL3 stages, respectively; the transcripts encoded ribosomal proteins, collagens and elongation factors (in L3), and mainly peptidases and other enzymes of amino acid catabolism (in xL3). Caenorhabditis elegans orthologues of transcripts that were uniquely transcribed in each L3 and xL3 were predicted to interact with a total of 535 other genes, all of which were involved in embryonic development.
Conclusion:
The present study indicated that some key transcriptional alterations taking place during the transition from the L3 to the xL3 stage of H. contortus involve genes predicted to be linked to the development of neuronal tissue (L3 and xL3), formation of the cuticle (L3) and digestion of host haemoglobin (xL3). Future efforts using next-generation sequencing and bioinformatic technologies should provide the efficiency and depth of coverage required for the determination of the complete transcriptomes of different developmental stages and/or tissues of H. contortus as well as the genome of this important parasitic nematode. Such advances should lead to a significantly improved understanding of the molecular biology of H. contortus and, from an applied perspective, to novel methods of intervention
Predicting variation of DNA shape preferences in protein-DNA interaction in cancer cells with a new biophysical model
DNA shape readout is an important mechanism of target site recognition by
transcription factors, in addition to the sequence readout. Several models of
transcription factor-DNA binding which consider DNA shape have been developed
in recent years. We present a new biophysical model of protein-DNA interaction
by considering the DNA shape features, which is based on a neighbour
dinucleotide dependency model BayesPI2. The parameters of the new model are
restricted to a subspace spanned by the 2-mer DNA shape features, which
allowing a biophysical interpretation of the new parameters as
position-dependent preferences towards certain values of the features. Using
the new model, we explore the variation of DNA shape preferences in several
transcription factors across cancer cell lines and cellular conditions. We find
evidence of DNA shape variations at FOXA1 binding sites in MCF7 cells after
treatment with steroids. The new model is useful for elucidating finer details
of transcription factor-DNA interaction. It may be used to improve the
prediction of cancer mutation effects in the future
Recommended from our members
Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing.
Transcription factor-DNA interactions are some of the most important processes in biology because they directly control hereditary information. The targets of most transcription factor are unknown. In this report, we introduce Bind-n-Seq, a new high-throughput method for analyzing protein-DNA interactions in vitro, with several advantages over current methods. The procedure has three steps (i) binding proteins to randomized oligonucleotide DNA targets, (ii) sequencing the bound oligonucleotide with massively parallel technology and (iii) finding motifs among the sequences. De novo binding motifs determined by this method for the DNA-binding domains of two well-characterized zinc-finger proteins were similar to those described previously. Furthermore, calculations of the relative affinity of the proteins for specific DNA sequences correlated significantly with previous studies (R(2 )= 0.9). These results present Bind-n-Seq as a highly rapid and parallel method for determining in vitro binding sites and relative affinities
- âŠ