Search CORE

12 research outputs found

Motif kernel generated by genetic programming improves remote homology and fold detection

Author: Hestnes Arne JH
Håndstad Tony
Sætrom Pål
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. RESULTS: We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection. CONCLUSION: The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

A ChIP-Seq Benchmark Shows That Sequence Conservation Mainly Improves Detection of Strong Transcription Factor Binding Sites

Author: A Moses
A Siepel
A Stark
BT Naughton
D Boffelli
D Karolchik
DT Odom
E Birney
Finn Drabløs
G Badis
G Sandve
J Bryne
J Ernst
J Hawkins
JA Hanley
K Klepper
L Elnitski
M Rye
M Tompa
Morten Beck Rye
P D'haeseleer
P Kheradpour
PJ Park
Pål Sætrom
R Jothi
Sridhar Hannenhalli
T Vavouri
Tony Håndstad
V Matys
WW Wasserman
X Xie
Y Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements

Author: A Barski
A Kanhere
A Marson
A Pekowska
A Rada-Iglesias
A Visel
AP Boyle
B Li
BE Bernstein
BE Bernstein
CM Koch
CZ Zang
D Karolchik
DS Johnson
E Birney
E Lieberman-Aiden
Finn Drabløs
G Hon
G Hon
GA Wray
GE Zentner
H Xu
H Yu
J Ernst
J Kim
JE Phillips
JM Vaquerizas
KJ Gaulton
KJ Won
KJ Won
KL MacQuarrie
L Ooi
LA Pennacchio
M Blanchette
M Bulger
M Gupta
M Guttman
MA Nobrega
MB Rye
MC Tsai
MH Kagey
Morten Rye
MP Creyghton
ND Heintzman
ND Heintzman
O Wallerman
PJ Farnham
PJ Park
PV Kharchenko
PV Kharchenko
Pål Sætrom
Q Zhou
R Jothi
S Cuddapah
S Roy
T Kouzarides
T Li
T Ravasi
TH Kim
TK Kim
Tony Håndstad
TS Mikkelsen
V Gotea
W Niu
X Chen
Y Zhang
Z Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Transcription factor binding to DNA requires both an appropriate binding element and suitably open chromatin, which together help to define regulatory elements within the genome. Current methods of identifying regulatory elements, such as promoters or enhancers, typically rely on sequence conservation, existing gene annotations or specific marks, such as histone modifications and p300 binding methods, each of which has its own biases. Results Herein we show that an approach based on clustering of transcription factor peaks from high-throughput sequencing coupled with chromatin immunoprecipitation (Chip-Seq) can be used to evaluate markers for regulatory elements. We used 67 data sets for 54 unique transcription factors distributed over two cell lines to create regulatory element clusters. By integrating the clusters from our approach with histone modifications and data for open chromatin, we identified general methylation of lysine 4 on histone H3 (H3K4me) as the most specific marker for transcription factor clusters. Clusters mapping to annotated genes showed distinct patterns in cluster composition related to gene expression and histone modifications. Clusters mapping to intergenic regions fall into two groups either directly involved in transcription, including miRNAs and long noncoding RNAs, or facilitating transcription by long-range interactions. The latter clusters were specifically enriched with H3K4me1, but less with acetylation of lysine 27 on histone 3 or p300 binding. Conclusion By integrating genomewide data of transcription factor binding and chromatin structure and using our data-driven approach, we pinpointed the chromatin marks that best explain transcription factor association with different regulatory elements. Our results also indicate that a modest selection of transcription factors may be sufficient to map most regulatory elements in the human genome.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

Protein Remote Homology Detection using Motifs made with Genetic Programming

Author: Håndstad Tony
Publication venue: Institutt for datateknikk og informasjonsvitenskap
Publication date: 01/01/2006
Field of study

A central problem in computational biology is the classification of related proteins into functional and structural classes based on their amino acid sequences. Several methods exist to detect related sequences when the level of sequence similarity is high, but for very low levels of sequence similarity the problem remains an unsolved challenge. Most recent methods use a discriminative approach and train support vector machines to distinguish related sequences from unrelated sequences. One successful approach is to base a kernel function for a support vector machine on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. We introduce a motif kernel based on discrete sequence motifs where the motifs are synthesised using genetic programming. The motifs are evolved to discriminate between different families of evolutionary origin. The motif matches in the sequence data sets are then used to compute kernels for support vector machine classifiers that are trained to discriminate between related and unrelated sequences. When tested on two updated benchmarks, the method yields significantly better results compared to several other proven methods of remote homology detection. The superiority of the kernel is especially visible on the problem of classifying sequences to the correct fold. A rich set of motifs made specifically for each SCOP superfamily makes it possible to classify more sequences correctly than with previous motif-based methods

NORA - Norwegian Open Research Archives

Protein Remote Homology Detection using Motifs made with Genetic Programming

Author: Håndstad Tony
Publication venue: Institutt for datateknikk og informasjonsvitenskap
Publication date: 01/01/2006
Field of study

NORA - Norwegian Open Research Archives

The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding

Author: Drabløs Finn
Håndstad Tony
Kornacker K
Rye Morten Beck
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Background: Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an acceptable false discovery rate. Results: We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. Conclusions: Triform outperforms several existing methods in the identification of representative peak profiles in curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via http://tare.medisin.ntnu.no/triform. Keywords: ChIP-Seq, Peak finding, Benchmark, Repeat

Springer - Publisher Connector

NORA - Norwegian Open Research Archives

The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding

Author: Drabløs Finn
Håndstad Tony
Kornacker Karl
Rye Morten
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2012
Field of study

Abstract Background Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way and with an acceptable false discovery rate. Results We present the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. Conclusions Triform outperforms several existing methods in the identification of representative peak profiles in curated benchmark data sets. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. The Triform algorithm has been implemented in R, and is available via http://tare.medisin.ntnu.no/triform.</p

Directory of Open Access Journals

Developing a genetic analysis system for clinical purposes

Author: Eike Morten C.
Grünfeld Thomas
Håndstad Tony
Skorve Espen
Publication venue: 'BCS Learning and Development Limited'
Publication date: 01/01/2017
Field of study

VBN

A ChIP-Seq Benchmark Shows That Sequence Conservation Mainly Improves Detection of Strong Transcription Factor Binding Sites

Author: Drabløs Finn
Håndstad Tony
Rye Morten Beck
Sætrom Pål
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2011
Field of study

Background: Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial. Results: Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods. Conclusions: Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites

NORA - Norwegian Open Research Archives