6,043 research outputs found
Classification of protein interaction sentences via gaussian processes
The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
A topological approach for protein classification
Protein function and dynamics are closely related to its sequence and
structure. However prediction of protein function and dynamics from its
sequence and structure is still a fundamental challenge in molecular biology.
Protein classification, which is typically done through measuring the
similarity be- tween proteins based on protein sequence or physical
information, serves as a crucial step toward the understanding of protein
function and dynamics. Persistent homology is a new branch of algebraic
topology that has found its success in the topological data analysis in a
variety of disciplines, including molecular biology. The present work explores
the potential of using persistent homology as an indepen- dent tool for protein
classification. To this end, we propose a molecular topological fingerprint
based support vector machine (MTF-SVM) classifier. Specifically, we construct
machine learning feature vectors solely from protein topological fingerprints,
which are topological invariants generated during the filtration process. To
validate the present MTF-SVM approach, we consider four types of problems.
First, we study protein-drug binding by using the M2 channel protein of
influenza A virus. We achieve 96% accuracy in discriminating drug bound and
unbound M2 channels. Additionally, we examine the use of MTF-SVM for the
classification of hemoglobin molecules in their relaxed and taut forms and
obtain about 80% accuracy. The identification of all alpha, all beta, and
alpha-beta protein domains is carried out in our next study using 900 proteins.
We have found a 85% success in this identifica- tion. Finally, we apply the
present technique to 55 classification tasks of protein superfamilies over 1357
samples. An average accuracy of 82% is attained. The present study establishes
computational topology as an independent and effective alternative for protein
classification
- …