Search CORE

3,019 research outputs found

Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

Author: Ifrim Georgiana
Wiuf Carsten
Publication venue
Publication date: 03/08/2010
Field of study

We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem

arXiv.org e-Print Archive

CiteSeerX

The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

Author: Akin
Altschul
Andersen
Andreasen
Aurilia
Barnum
Bartolomé
Bendtsen
Benner
Benoit
Benoit
Bhasin
Bhasin
Blum
Cai
Cai
Castanares
Chang
Choi
Crepin
D.B.R.K. Gupta Udatha
Dodd
Donaghy
Donaghy
Dudoit
Dysvik
Ewing
Faulds
Ferguson
Fillingham
Finn
Garcia-Conesa
García-Conesa
Garrigues
Gasteiger
Gasteiger
Gianni Panagiotou
Giuliani
Goldstone
Hall
Han
Hatzakis
Henikoff
Hermoso
Hsu
Humberstone
Huson
Irene Kouskoumvekaki
Kaiser
Karchin
Keerthi
Kheder
Kikuzaki
Kim
Kohavi
Kohonen
Koseki
Koseki
Kroon
Kroon
Kumar
Lao
Larkin
Laszlo
Latha
Lee
Lesage-Meessen
Levasseur
Levasseur
Li
Lima
Lisbeth Olsson
MacKay
Marcotte
McAuley
Meinicke
Morris
Mukherjee
Nielsen
Noble
Nsereko
Oili
Ong
Platt
Prates
Pérez-Bercoff
Rashamuse
Record
Rost
Sancho
Sankararaman
Sankararaman
Schrödinger Suite 2009
Schubot
Slavin
Tarbouriech
Teodoro
Thompson
Tomoko
Topakas
Topakas
Topakas
Topakas
Topakas
Tsuchiyama
Tsuchiyama
Uestuen
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Wang
Wang
Wang
Wilkinson
Publication venue
Publication date: 11/08/2010
Field of study

One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

Crossref

Chalmers Research

Nature Precedings

Online Research Database In Technology

Chalmers Publication Library

HKU Scholars Hub

LipocalinPred: a SVM-based method for prediction of lipocalins

Author: A Ben-Hur
A Garg
A Garg
A Sali
AS Martin Vogt
B Adam
C Leslie
D Holloway
D Plewczynski
Dinesh Gupta
DR Flower
DR Flower
G Wang
H Saiga
H Saigo
J Ahnstrom
J Duan
J Hull-Thompson
J Thorsten
JA Swets
Jayashree Ramana
LJ McGuffin
M Sieber
M Zervakis
NV Vapnik
P Pavlidis
R Rajakariar
S Ahmad
S Arne
SF Altschul
SR Eddy
W Deng
X Yu
YR Chan
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures. Results In the present study we propose a SVM based method for identification of lipocalin protein sequences. The SVM models were trained with the input features generated using amino acid, dipeptide and secondary structure compositions as well as PSSM profiles. The model derived using both PSSM and secondary structure emerged as the best model in the study. Apart from achieving a high prediction accuracy (>90% in leave-one-out), lipocalinpred correctly differentiates closely related fatty acid-binding proteins and triabins as non-lipocalins. Conclusion The method offers a promising approach as a lipocalin prediction tool, complementing PROSITE, Pfam and homology modelling methods.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure

Author: Altun Gulsah
Publication venue: ScholarWorks @ Georgia State University
Publication date: 22/04/2008
Field of study

Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers

ScholarWorks @ Georgia State University

A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models

Author: A Andreeva
A Ben-Hur
A Karwath
A Karwath
A Shah
Alessandra Carbone
B Liu
B Qian
B Webb-Robertson
C Ferreira
C Leslie
D Higgins
F Wilcoxon
G Yona
Gerson Zaverucha
H Rangwala
H Saigo
J Bernardes
J Davis
J Gough
J Quinlan
J Soeding
J Weston
Juliana S Bernardes
L De Raedt
L Dehaspe
L Liao
N Shan-Hwei
Q Dong
Q Su
R Agrawal
R Hughey
R King
R King
R Kuang
R Sadreyev
S Altschul
S Altschul
S Brenner
S Eddy
S Eddy
S Kawashima
S Lee
T Handstad
T Jaakkola
T Lingner
U Syed
V Alexandrov
V Atalay
Y Hou
Y Hou
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). Results We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. Conclusions The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

HAL-Inserm

PubMed Central

A method for probabilistic mapping between protein structure and function taxonomies through cross training

Author: Gupta Kshitiz
Levchenko Andre
Sehgal Vivek
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Prediction of function of proteins on the basis of structure and vice versa is a partially solved problem, largely in the domain of biophysics and biochemistry. This underlies the need of computational and bioinformatics approach to solve the problem. Large and organized latent knowledge on protein classification exists in the form of independently created protein classification databases. By creating probabilistic maps between classes of structural classification databases (e.g. SCOP <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>) and classes of functional classification databases (e.g. PROSITE <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>), structure and function of proteins could be probabilistically related. Results We demonstrate that PROSITE and SCOP have significant semantic overlap, in spite of independent classification schemes. By training classifiers of SCOP using classes of PROSITE as attributes and vice versa, accuracy of Support Vector Machine classifiers for both SCOP and PROSITE was improved. Novel attributes, 2-D elastic profiles and Blocks were used to improve time complexity and accuracy. Many relationships were extracted between classes of SCOP and PROSITE using decision trees. Conclusion We demonstrate that presented approach can discover new probabilistic relationships between classes of different taxonomies and render a more accurate classification. Extensive mappings between existing protein classification databases can be created to link the large amount of organized data. Probabilistic maps were created between classes of SCOP and PROSITE allowing predictions of structure using function, and vice versa. In our experiments, we also found that functions are indeed more strongly related to structure than are structure to functions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Dspace at IIT Bombay

計算生物学におけるカーネル法(数学者のための分子生物学入門,研究会報告)

Author: Vert Jean-Philippe
阿久津達也
Publication venue: 物性研究刊行会
Publication date: 20/10/2003
Field of study

この論文は国立情報学研究所の電子図書館事業により電子化されました。1.緒言計算生物学の研究目的の一つは、実験的研究により生成される大量のデータを解析し、生物学的に有用な仮説を自動的に導くための計算手法を開発することである。また、生物学においては多種多様なデータが生成されるため、それらを統合して扱うことのできる数学的枠組みを見出すことは重要な課題の一つである。計算生物学において対象となるデータには、遺伝子配列データ、化学構造データ、遺伝子発現データなどがあるが、ここでは、これらを統一的に扱うことを可能とするカーネル法について説明する。カーネル法はここ十年間に機械学習分野において発展してきた手法であり、生物学を含む数多くの問題に応用されている。2. Mercerカーネル集合Xの直積から実数へ関数K(x,y)が、対称性(K(x,y)=K(y,x))を満たし、さらに、正定値性を満たす場合に、関数K(.,.)はMercerカーネルと呼ばれる。K(.,.)がMercerカーネルである場合、あるヒルベルト空間Φ、および、XからΦへの関数φ(x)が存在し、K(x,y)はφ(x)とφ(y)の内積となる。より、厳密にはRKHS (reproducing kernel Hilbert space)と呼ばれるヒルベルト空間を用いることにより、MercerカーネルとRKHSを対応づけすることができる。また、RKHSの重要な性質として、RKHSが無限次元空間であっても、ある条件下で正則化された関数の最小化が有限個の点のみを考慮することで行えるということがあげられる。3.カーネル法カーネル法の大きな利点の一つとして、ヒルベルト空間へ写像すること無しに種々の計算が行えることがあげられ、このことはカーネルトリックと呼ばれる。簡単な例としてはヒルベルト空間における2点間の距離がカーネル関数の簡単な組み合わせで求めることができる。より有用な例として、統計解析の主要手法の一つである主成分分析(PCA)が、カーネルを用いた場合にも、ヒルベルト空間における計算なしに行える。カーネルを用いた正準相関分析(CCA)は固有値計算問題に帰着することができ、二種類のデータを統合した解析を行うのに有用である。サポートベクターマシン(SVM)はカーネル法に基づく(教師あり)機械学習のための手法で、正負の例が与えられた時、正負の例を分離し、かつ、最近点までの距離(マージン)が最大となる超平面を計算する。実際には、正負の例を完全に分離することが不可能である場合が多いので、分類誤差と距離をトレードオフしたものを最適化する。SVMでは、カーネルトリックにより、最適な分離超平面が(多くの場合には少ないサイズの)正負の例の部分集合に対するカーネルの組み合わせにより表現される。4.タンパク質データに対するカーネル法カーネル法を生物学データに適用するため、タンパク質や関連するデータに対するカーネル関数が提案されている。特に、配列(文字列)に対するカーネル関数はよく研究されている。長さkの部分文字列の出現頻度のベクトルを用いることにより、文字列からユークリッド空間へのカーネル関数を定義できるが、この手法はspectrumカーネルと呼ばれている。また、配列解析に広く利用されている確率モデルである隠れマルコフモデル(HMM)などから情報を抽出することによりカーネル関数を定義する、Fisherカーネルも提案されている。配列データ以外には、遺伝子発現データ、Phylogenetic Profileなどを扱うためのカーネルや、グラフ構造に関するdiffusionカーネルとカーネルCCAを組み合わせ代謝パスウェイと発現データの相関を抽出する研究などが行われている。カーネルの組み合わせに関する研究も行われており、半正定値計画法による、カーネルの線形結合の最適化などが研究されている

Kyoto University Research Information Repository

Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores

Author: AJ Godkin
B Efron
B Schölkopf
B Schölkopf
CK Hattotuwagama
D Haussler
DA Rhodes
Darren R Flower
FR Burden
G Bonomi
GP Raghava
H Kropshofer
H Noguchi
H Noguchi
H Rammensee
H Saigo
IA Doytchinova
J Hammer
J Hammer
J Xia
JC Tong
JD Blake
Jesper Salomon
JP Vert
JW Yewdell
M Bhasin
M Bhasin
M Nielsen
M Xiao YS
MH Wauben
N Murugan
O Karpenko
P Donnes
P Guan
PY Arnold
R Kuang
RR Mallios
RT Carson
S Henikoff
S Kawashima
SF Altschul
T Muller
T Muller
TF Smith
V Brusic
V Brusic
VN Vapnik
W Liu
Y Bengio
Z Dosztanyi
Z Zavala-Ruiz
Z Zavala-Ruiz
ZR Yang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP [1], MCHBN [2], and MHCBench [3]. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 A(ROC )for the MHCBench data sets (up from 0.756), and an average of 0.96 A(ROC )for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems

Crossref

Directory of Open Access Journals

PubMed Central

Aston Publications Explorer

Oxford University Research Archive

Directed acyclic graph kernels for structural RNA analysis

Author: B Knudsen
B Schölkopf
CB Do
D Haussler
D Sankoff
DB Searls
DM Tax
E Rivas
EK Freyhult
H Kiryu
H Saigo
I Holmes
IL Hofacker
IL Hofacker
J Hertel
J Hertel
JD Thompson
JS McCaskill
JS Pedersen
JW Brown
K Sato
Kengo Sato
Kiyoshi Asai
MA Rosenblad
P Pacheco
RD Dowell
RE Fan
RJ Klein
S Washietl
S Washietl
S Will
SR Eddy
SR Eddy
SR Eddy
T Babak
T Kin
Toutai Mituyama
W Deng
Y Sakakibara
Y Sakakibara
Y Sakakibara
Yasubumi Sakakibara
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity. Results We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering. Conclusion Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Peptide classification using optimal and information theoretic syntactic modeling

Author: Aygün Ezra
Cataltepe Z
Oommen B. John
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

We consider the problem of classifying peptides using the information residing in their syntactic representations. This problem, which has been studied for more than a decade, has typically been investigated using distance-based metrics that involve the edit operations required in the peptide comparisons. In this paper, we shall demonstrate that the Optimal and Information Theoretic (OIT) model of Oommen and Kashyap [22] applicable for syntactic pattern recognition can be used to tackle peptide classification problem. We advocate that one can model the differences between compared strings as a mutation model consisting of random substitutions, insertions and deletions obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a support vector machine (SVM)-based peptide classifier can be devised. The classifier, which we have built has been tested for eight different substitution matrices and for two different data sets, namely, the HIV-1 Protease cleavage sites and the T-cell epitopes. The results show that the OIT model performs significantly better than the one which uses a Needleman-Wunsch sequence alignment score, it is less sensitive to the substitution matrix than the other methods compared, and that when combined with a SVM, is among the best peptide classification methods availabl

Crossref

NORA - Norwegian Open Research Archives

Agder University Research Archive