12 research outputs found
Motif kernel generated by genetic programming improves remote homology and fold detection
BACKGROUND: Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. RESULTS: We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection. CONCLUSION: The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods
A ChIP-Seq Benchmark Shows That Sequence Conservation Mainly Improves Detection of Strong Transcription Factor Binding Sites
Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites
Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements
<p>Abstract</p> <p>Background</p> <p>Transcription factor binding to DNA requires both an appropriate binding element and suitably open chromatin, which together help to define regulatory elements within the genome. Current methods of identifying regulatory elements, such as promoters or enhancers, typically rely on sequence conservation, existing gene annotations or specific marks, such as histone modifications and p300 binding methods, each of which has its own biases.</p> <p>Results</p> <p>Herein we show that an approach based on clustering of transcription factor peaks from high-throughput sequencing coupled with chromatin immunoprecipitation (Chip-Seq) can be used to evaluate markers for regulatory elements. We used 67 data sets for 54 unique transcription factors distributed over two cell lines to create regulatory element clusters. By integrating the clusters from our approach with histone modifications and data for open chromatin, we identified general methylation of lysine 4 on histone H3 (H3K4me) as the most specific marker for transcription factor clusters. Clusters mapping to annotated genes showed distinct patterns in cluster composition related to gene expression and histone modifications. Clusters mapping to intergenic regions fall into two groups either directly involved in transcription, including miRNAs and long noncoding RNAs, or facilitating transcription by long-range interactions. The latter clusters were specifically enriched with H3K4me1, but less with acetylation of lysine 27 on histone 3 or p300 binding.</p> <p>Conclusion</p> <p>By integrating genomewide data of transcription factor binding and chromatin structure and using our data-driven approach, we pinpointed the chromatin marks that best explain transcription factor association with different regulatory elements. Our results also indicate that a modest selection of transcription factors may be sufficient to map most regulatory elements in the human genome.</p
Protein Remote Homology Detection using Motifs made with Genetic Programming
A central problem in computational biology is the classification of related proteins into functional and structural classes based on their amino acid sequences. Several methods exist to detect related sequences when the level of sequence similarity is high, but for very low levels of sequence similarity the problem remains an unsolved challenge. Most recent methods use a discriminative approach and train support vector machines to distinguish related sequences from unrelated sequences. One successful approach is to base a kernel function for a support vector machine on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. We introduce a motif kernel based on discrete sequence motifs where the motifs are synthesised using genetic programming. The motifs are evolved to discriminate between different families of evolutionary origin. The motif matches in the sequence data sets are then used to compute kernels for support vector machine classifiers that are trained to discriminate between related and unrelated sequences. When tested on two updated benchmarks, the method yields significantly better results compared to several other proven methods of remote homology detection. The superiority of the kernel is especially visible on the problem of classifying sequences to the correct fold. A rich set of motifs made specifically for each SCOP superfamily makes it possible to classify more sequences correctly than with previous motif-based methods
Protein Remote Homology Detection using Motifs made with Genetic Programming
A central problem in computational biology is the classification of related proteins into functional and structural classes based on their amino acid sequences. Several methods exist to detect related sequences when the level of sequence similarity is high, but for very low levels of sequence similarity the problem remains an unsolved challenge. Most recent methods use a discriminative approach and train support vector machines to distinguish related sequences from unrelated sequences. One successful approach is to base a kernel function for a support vector machine on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. We introduce a motif kernel based on discrete sequence motifs where the motifs are synthesised using genetic programming. The motifs are evolved to discriminate between different families of evolutionary origin. The motif matches in the sequence data sets are then used to compute kernels for support vector machine classifiers that are trained to discriminate between related and unrelated sequences. When tested on two updated benchmarks, the method yields significantly better results compared to several other proven methods of remote homology detection. The superiority of the kernel is especially visible on the problem of classifying sequences to the correct fold. A rich set of motifs made specifically for each SCOP superfamily makes it possible to classify more sequences correctly than with previous motif-based methods
The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding
Background: Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most
frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as
peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles
are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an
acceptable false discovery rate.
Results: We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq
enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions
of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of
ChIP-Seq data.
Conclusions: Triform outperforms several existing methods in the identification of representative peak profiles in
curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more
consistent with biological function, compared with other methods. Finally, we show that Triform can be used to
generate novel information on transcription factor binding in repeat regions, which represents a particular
challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via
http://tare.medisin.ntnu.no/triform.
Keywords: ChIP-Seq, Peak finding, Benchmark, Repeat
The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding
Abstract Background Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an acceptable false discovery rate. Results We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. Conclusions Triform outperforms several existing methods in the identification of representative peak profiles in curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via http://tare.medisin.ntnu.no/triform.</p
A ChIP-Seq Benchmark Shows That Sequence Conservation Mainly Improves Detection of Strong Transcription Factor Binding Sites
Background:
Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.
Results:
Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.
Conclusions:
Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites