3,019 research outputs found

    Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

    Full text link
    We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem

    The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

    Get PDF
    One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

    LipocalinPred: a SVM-based method for prediction of lipocalins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures.</p> <p>Results</p> <p>In the present study we propose a SVM based method for identification of lipocalin protein sequences. The SVM models were trained with the input features generated using amino acid, dipeptide and secondary structure compositions as well as PSSM profiles. The model derived using both PSSM and secondary structure emerged as the best model in the study. Apart from achieving a high prediction accuracy (>90% in leave-one-out), lipocalinpred correctly differentiates closely related fatty acid-binding proteins and triabins as non-lipocalins.</p> <p>Conclusion</p> <p>The method offers a promising approach as a lipocalin prediction tool, complementing PROSITE, Pfam and homology modelling methods.</p

    Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure

    Get PDF
    Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers

    A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM).</p> <p>Results</p> <p>We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function.</p> <p>Conclusions</p> <p>The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.</p

    A method for probabilistic mapping between protein structure and function taxonomies through cross training

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction of function of proteins on the basis of structure and vice versa is a partially solved problem, largely in the domain of biophysics and biochemistry. This underlies the need of computational and bioinformatics approach to solve the problem. Large and organized latent knowledge on protein classification exists in the form of independently created protein classification databases. By creating probabilistic maps between classes of structural classification databases (e.g. SCOP <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>) and classes of functional classification databases (e.g. PROSITE <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>), structure and function of proteins could be probabilistically related.</p> <p>Results</p> <p>We demonstrate that PROSITE and SCOP have significant semantic overlap, in spite of independent classification schemes. By training classifiers of SCOP using classes of PROSITE as attributes and vice versa, accuracy of Support Vector Machine classifiers for both SCOP and PROSITE was improved. Novel attributes, 2-D elastic profiles and Blocks were used to improve time complexity and accuracy. Many relationships were extracted between classes of SCOP and PROSITE using decision trees.</p> <p>Conclusion</p> <p>We demonstrate that presented approach can discover new probabilistic relationships between classes of different taxonomies and render a more accurate classification. Extensive mappings between existing protein classification databases can be created to link the large amount of organized data. Probabilistic maps were created between classes of SCOP and PROSITE allowing predictions of structure using function, and vice versa. In our experiments, we also found that functions are indeed more strongly related to structure than are structure to functions.</p

    計算生物学におけるカーネル法(数学者のための分子生物学入門,研究会報告)

    Get PDF
    この論文は国立情報学研究所の電子図書館事業により電子化されました。1.緒言 計算生物学の研究目的の一つは、実験的研究により生成される大量のデータを解析し、生物学的に有用な仮説を自動的に導くための計算手法を開発することである。また、生物学においては多種多様なデータが生成されるため、それらを統合して扱うことのできる数学的枠組みを見出すことは重要な課題の一つである。計算生物学において対象となるデータには、遺伝子配列データ、化学構造データ、遺伝子発現データなどがあるが、ここでは、これらを統一的に扱うことを可能とするカーネル法について説明する。カーネル法はここ十年間に機械学習分野において発展してきた手法であり、生物学を含む数多くの問題に応用されている。2. Mercerカーネル 集合Xの直積から実数へ関数K(x,y)が、対称性(K(x,y)=K(y,x))を満たし、さらに、正定値性を満たす場合に、関数K(.,.)はMercerカーネルと呼ばれる。K(.,.)がMercerカーネルである場合、あるヒルベルト空間Φ、および、XからΦへの関数φ(x)が存在し、K(x,y)はφ(x)とφ(y)の内積となる。より、厳密にはRKHS (reproducing kernel Hilbert space)と呼ばれるヒルベルト空間を用いることにより、MercerカーネルとRKHSを対応づけすることができる。また、RKHSの重要な性質として、RKHSが無限次元空間であっても、ある条件下で正則化された関数の最小化が有限個の点のみを考慮することで行えるということがあげられる。3.カーネル法 カーネル法の大きな利点の一つとして、ヒルベルト空間へ写像すること無しに種々の計算が行えることがあげられ、このことはカーネルトリックと呼ばれる。簡単な例としてはヒルベルト空間における2点間の距離がカーネル関数の簡単な組み合わせで求めることができる。より有用な例として、統計解析の主要手法の一つである主成分分析(PCA)が、カーネルを用いた場合にも、ヒルベルト空間における計算なしに行える。カーネルを用いた正準相関分析(CCA)は固有値計算問題に帰着することができ、二種類のデータを統合した解析を行うのに有用である。サポートベクターマシン(SVM)はカーネル法に基づく(教師あり)機械学習のための手法で、正負の例が与えられた時、正負の例を分離し、かつ、最近点までの距離(マージン)が最大となる超平面を計算する。実際には、正負の例を完全に分離することが不可能である場合が多いので、分類誤差と距離をトレードオフしたものを最適化する。SVMでは、カーネルトリックにより、最適な分離超平面が(多くの場合には少ないサイズの)正負の例の部分集合に対するカーネルの組み合わせにより表現される。4.タンパク質データに対するカーネル法 カーネル法を生物学データに適用するため、タンパク質や関連するデータに対するカーネル関数が提案されている。特に、配列(文字列)に対するカーネル関数はよく研究されている。長さkの部分文字列の出現頻度のベクトルを用いることにより、文字列からユークリッド空間へのカーネル関数を定義できるが、この手法はspectrumカーネルと呼ばれている。また、配列解析に広く利用されている確率モデルである隠れマルコフモデル(HMM)などから情報を抽出することによりカーネル関数を定義する、Fisherカーネルも提案されている。配列データ以外には、遺伝子発現データ、Phylogenetic Profileなどを扱うためのカーネルや、グラフ構造に関するdiffusionカーネルとカーネルCCAを組み合わせ代謝パスウェイと発現データの相関を抽出する研究などが行われている。カーネルの組み合わせに関する研究も行われており、半正定値計画法による、カーネルの線形結合の最適化などが研究されている

    Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores

    Get PDF
    BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP [1], MCHBN [2], and MHCBench [3]. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 A(ROC )for the MHCBench data sets (up from 0.756), and an average of 0.96 A(ROC )for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems

    Directed acyclic graph kernels for structural RNA analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity.</p> <p>Results</p> <p>We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering.</p> <p>Conclusion</p> <p>Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.</p

    Peptide classification using optimal and information theoretic syntactic modeling

    Get PDF
    We consider the problem of classifying peptides using the information residing in their syntactic representations. This problem, which has been studied for more than a decade, has typically been investigated using distance-based metrics that involve the edit operations required in the peptide comparisons. In this paper, we shall demonstrate that the Optimal and Information Theoretic (OIT) model of Oommen and Kashyap [22] applicable for syntactic pattern recognition can be used to tackle peptide classification problem. We advocate that one can model the differences between compared strings as a mutation model consisting of random substitutions, insertions and deletions obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a support vector machine (SVM)-based peptide classifier can be devised. The classifier, which we have built has been tested for eight different substitution matrices and for two different data sets, namely, the HIV-1 Protease cleavage sites and the T-cell epitopes. The results show that the OIT model performs significantly better than the one which uses a Needleman-Wunsch sequence alignment score, it is less sensitive to the substitution matrix than the other methods compared, and that when combined with a SVM, is among the best peptide classification methods availabl
    corecore