34 research outputs found

    TFpredict and SABINE: Sequence-Based Prediction of Structural and Functional Characteristics of Transcription Factors

    Get PDF
    <div><p>One of the key mechanisms of transcriptional control are the specific connections between transcription factors (TF) and <i>cis</i>-regulatory elements in gene promoters. The elucidation of these specific protein-DNA interactions is crucial to gain insights into the complex regulatory mechanisms and networks underlying the adaptation of organisms to dynamically changing environmental conditions. As experimental techniques for determining TF binding sites are expensive and mostly performed for selected TFs only, accurate computational approaches are needed to analyze transcriptional regulation in eukaryotes on a genome-wide level. We implemented a four-step classification workflow which for a given protein sequence (1) discriminates TFs from other proteins, (2) determines the structural superclass of TFs, (3) identifies the DNA-binding domains of TFs and (4) predicts their <i>cis</i>-acting DNA motif. While existing tools were extended and adapted for performing the latter two prediction steps, the first two steps are based on a novel numeric sequence representation which allows for combining existing knowledge from a BLAST scan with robust machine learning-based classification. By evaluation on a set of experimentally confirmed TFs and non-TFs, we demonstrate that our new protein sequence representation facilitates more reliable identification and structural classification of TFs than previously proposed sequence-derived features. The algorithms underlying our proposed methodology are implemented in the two complementary tools TFpredict and SABINE. The online and stand-alone versions of TFpredict and SABINE are freely available to academics at <a href="http://www.cogsys.cs.uni-tuebingen.de/software/TFpredict/" target="_blank">http://www.cogsys.cs.uni-tuebingen.de/software/TFpredict/</a> and <a href="http://www.cogsys.cs.uni-tuebingen.de/software/SABINE/" target="_blank">http://www.cogsys.cs.uni-tuebingen.de/software/SABINE/</a>.</p></div

    Accessing planetary plasma datasets via the TAP and PDAP protocols

    No full text
    International audienceThere are many challenges to achieving interoperability and data sharing across heterogeneous systems. Systems and data are implemented and stored across multiple platforms and specifications. This has created rigid point-topoint integrations. To allow the interoperability of data discovery when querying planetary science data centers, the needs are to have common standards and specifications to search and retrieve data from the disparate sources provides. From a data producer perspective, they are provided with a common construct on how to expose their data, without having to compromise their internal implementation, that users and systems can easily discover, search, and consume. TAP (Table Access Protocol) and PDAP (Planetary Data Access Protocol) are protocols to access, distributed and retrieve planetary data. They can permit to provide an interoperable and flexible environment to search, aggregate and retrieve data. We will present the prototype of interoperable system planetary plasma datasets based in the Planetary Science Resource Data Model designed by EuroPlaNet IDIS (Integrated and Distributed Information Service) via the TAP and PDAP protocols

    Accessing planetary plasma datasets via the TAP and PDAP protocols

    No full text
    International audienceThere are many challenges to achieving interoperability and data sharing across heterogeneous systems. Systems and data are implemented and stored across multiple platforms and specifications. This has created rigid point-topoint integrations. To allow the interoperability of data discovery when querying planetary science data centers, the needs are to have common standards and specifications to search and retrieve data from the disparate sources provides. From a data producer perspective, they are provided with a common construct on how to expose their data, without having to compromise their internal implementation, that users and systems can easily discover, search, and consume. TAP (Table Access Protocol) and PDAP (Planetary Data Access Protocol) are protocols to access, distributed and retrieve planetary data. They can permit to provide an interoperable and flexible environment to search, aggregate and retrieve data. We will present the prototype of interoperable system planetary plasma datasets based in the Planetary Science Resource Data Model designed by EuroPlaNet IDIS (Integrated and Distributed Information Service) via the TAP and PDAP protocols

    Calculation of BLAST bit score percentile features.

    No full text
    <p>The protein sequence is aligned to TF and non-TF sequences in a non-redundant sequence database, which does not contain the input sequence itself. Next, the bit scores of all TFs and non-TFs among the BLAST hits are extracted from the BLAST result. The bit score distributions observed for TFs and non-TFs, respectively, are represented based on the minimum <i>p<sub>0</sub></i>, the lower quartile <i>p<sub>25</sub></i>, the median <i>p<sub>50</sub></i>, the upper quartile <i>p<sub>75</sub></i> and the maximum <i>p<sub>100</sub></i>. The bit score feature representation is then obtained by concatenation of the components calculated for the TF and non-TF class. In addition to binary classification tasks this feature representation is also applicable to multiclass problems, such as the prediction of TF superclasses. For this purpose, the feature vector components capturing the bit score distributions of each superclass were concatenated.</p

    Performance comparison against previous approaches.

    No full text
    <p>(<b>A</b>) The classification performance achieved by our novel sequence feature representation in conjunction with SVM classifiers was compared to two other SVM-based approaches, which were previously published by Zheng <i>et al.</i> and Kumar <i>et al.</i>, respectively. The Kumar method employs SVMs trained on PSSM profile features (orange curve in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082238#pone-0082238-g003" target="_blank">Figure 3C</a>) and the Zheng method corresponds to SVMs incorporating functional domain features (orange curve in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082238#pone-0082238-g003" target="_blank">Figure 3D</a>). The prediction accuracy was assessed in terms of the area under the threshold-averaged ROC curves obtained from stratified 4Ă—4-fold nested cross-validation. The bar plot beside the ROC curves depicts the area under the curve that was observed for each of the three approaches. (<b>B</b>) Similar plots as in (A), showing the results of ROC evaluation for the task of predicting the structural superclasses of TFs. Kumar's method, which was originally devised for the prediction of DNA-binding proteins, was extended to facilitate the discrimination of multiple superclasses. The corresponding ROC curve is identical to the orange curve in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082238#pone-0082238-g004" target="_blank">Figure 4C</a>. The method by Zheng <i>et al.</i>, which was originally designed for the specific detection of 4 superclasses, was extended to the five-class problem evaluated here. As described in more detail in the methods section the extended Zheng method is based on a metaclassifier that integrates the prediction outcomes of 15 binary SVMs based on an Error-Correcting Output Codes (ECOC).</p

    Bioinformatics pipeline for the structural and functional annotation of transcription factors.

    No full text
    <p>First the input protein sequence is aligned to a non-redundant protein database using the BLAST heuristic. The bit score distributions of the TFs and non-TFs among the BLAST hits are represented by means of percentiles. These percentiles are incorporated into SVM classifiers for the discrimination of TFs from non-TFs (Step 1). If a given protein sequence was classified as a TF, another SVM is applied to predict its structural superclass (Step 2). The tool InterProScan is used to predict the functional domains of the TF and the DNA-binding domains among these are identified based on the associated GO terms (Step 3). Finally, the tool SABINE infers a DNA motif using an SVR-based algorithm (see Methods section) that takes the structural superclass and DNA-binding domains of the TF as input (Step 4).</p

    Evaluation of classifiers and feature types for TF/non-TF discrimination.

    No full text
    <p>(<b>A</b>) Each of the shown curves corresponds to one of five supervised machine learning methods trained on our novel bit score percentile features, which were employed to distinguish TFs from other proteins. The individual curves obtained for each of the four cross-validation folds were averaged based on the class discrimination cutoffs. Averaged ROC curves were computed in an analogous manner for (<b>B</b>) <i>k</i>-mer features, (<b>C</b>) PSSM profile features, (<b>D</b>) functional domain features and (<b>E</b>) pseudo amino acid features. The sensitivity and specificity achieved by the naive BLAST-based approach correspond to a single point in ROC space marked by an asterisk.</p

    Evaluation of classifiers and feature types for superclass prediction.

    No full text
    <p>The classification performance of representative and widely used machine learning methods incorporating different features for superclass prediction was assessed my means of threshold-averaged ROC curves obtained from stratified 4Ă—4-fold nested cross-validation. The differently colored curves correspond to distinct classification methods (see legend). For each classifier the area under the curve (AUC) is denoted. ROC curves were obtained from classifiers incorporating (<b>A</b>) our novel bit score percentile features, (<b>B</b>) <i>k</i>-mer features (<b>C</b>) PSSM profile features (<b>D</b>) functional domain features and (<b>E</b>) pseudo amino acid features.</p

    Evaluation of different features for DNA motif prediction.

    No full text
    <p>(<b>A</b>) The deviation between the predicted and annotated DNA motifs (i.e., PFM transfer error) was assessed based on the average [0, 1]-distance (see Methods section) by 4-fold stratified cross-validation. The curves indicate the average PFM transfer error observed for different features depending on the minimum PFM similarity (i.e., best match threshold) predicted for the training set TFs, whose PFMs were merged to generate the predicted PFM. (<b>B</b>) The relative frequency with which a DNA motif could be predicted for a given TF (i.e., PFM transfer rate) was concurrently determined for varying best match thresholds. The shown curves correspond to the PFM transfer rate observed for different features, which were incorporated into the SVR models used for PFM similarity estimation.</p

    Exhaustive Error-Correcting Output Code for TF superclass prediction.

    No full text
    <p><sup></sup> The table shows the code used for the construction of a 5-class ECOC classifier which integrates the prediction outcomes of 15 binary SVM classifiers. Each column corresponds to a two-class SVM, which treats structural classes assigned to 1 as positives and classes assigned to 0 as negatives. The rows correspond to the 5 superclasses. Each entry (bit) in the table equals to the binary prediction outcome expected from a certain SVM classifier for a query protein of a specific superclass.</p
    corecore