371 research outputs found
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Remote Homology Detection of Protein Sequences
The classification of protein sequences using string kernels
provides valuable insights for protein function prediction. Almost
all string kernels are based on patterns that are not independent,
and therefore the associated scores are obtained using a set of
redundant features. In this talk we will discuss how a class of
patterns, called Irredundant, is specifically designed to address
this issue. Loosely speaking the set of Irredundant patterns is the
smallest class of independent patterns that can describe all
patterns in a string. We present a classification method based on
the statistics of these patterns, named Irredundant Class. Results
on benchmark data show that Irredundant Class outperforms most of
the string kernel methods previously proposed, and it achieves
results as good as the current state-of-the-art methods with a fewer
number of patterns. Unfortunately we show that the information
carried by the irredundant patterns can not be easily interpreted,
thus alternative notions are needed
Probabilistic multiple kernel learning
The integration of multiple and possibly heterogeneous information sources for an overall decision-making process has been an open and unresolved research direction in computing science since its very beginning. This thesis attempts to address parts of that direction by proposing probabilistic data integration algorithms for multiclass decisions where an observation of interest is assigned to one of many categories based on a plurality of information channels
Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation
Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks β the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner
Two-Stage Fuzzy Multiple Kernel Learning Based on Hilbert-Schmidt Independence Criterion
Β© 1993-2012 IEEE. Multiple kernel learning (MKL) is a principled approach to kernel combination and selection for a variety of learning tasks, such as classification, clustering, and dimensionality reduction. In this paper, we develop a novel fuzzy multiple kernel learning model based on the Hilbert-Schmidt independence criterion (HSIC) for classification, which we call HSIC-FMKL. In this model, we first propose an HSIC Lasso-based MKL formulation, which not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also enables the global optimal solution to be computed efficiently by solving a Lasso optimization problem. Since the traditional support vector machine (SVM) is sensitive to outliers or noises in the dataset, fuzzy SVM (FSVM) is used to select the prediction hypothesis once the optimal kernel has been obtained. The main advantage of FSVM is that we can associate a fuzzy membership with each data point such that these data points can have different effects on the training of the learning machine. We propose a new fuzzy membership function using a heuristic strategy based on the HSIC. The proposed HSIC-FMKL is a two-stage kernel learning approach and the HSIC is applied in both stages. We perform extensive experiments on real-world datasets from the UCI benchmark repository and the application domain of computational biology which validate the superiority of the proposed model in terms of prediction accuracy
An automated approach to remote protein homology classification.
The classification of protein structures into evolutionary superfamilies, for example in the CATH or SCOP domain structure databases, although performed with varying degrees of automation, has remained a largely subjective activity guided by expert knowledge. The huge expansion of the Protein Structure Databank (PDB), partly due to the structural genomics initiatives, has posed significant challenges to maintaining the coverage of these structural classification resources. This is because the high degree of manual assessment currently involved has affected their ability to keep pace with high throughput structure determination. This thesis presents an evaluation of different methods used in remote homologue detection which was performed to identify the most powerful approaches currently available. The design and implementation of new protocols suitable for remote homologue detection was informed by an analysis of the extent to which different homologous superfamilies in CATH evolve in sequence, structure and function and characterisation of the mechanisms by which this occurs. This analysis revealed that relatives in some highly populated CATH superfamilies have diverged considerably in their structures. In diverse relatives, significant variations are observed in the secondary structure embellishments decorating the common structural core for the superfamily. There are also differences in the packing angles between secondary structures. Information on the variability observed in CATH superfamilies is collated in an established web resource the Dictionary of Homologous Superfamilies, which has been expanded and improved in a number of ways. A new structural comparison algorithm, CATHEDRAL, is described. This was developed to cope with the structural variation observed across CATH superfamilies and to improve the automatic recognition of domain boundaries in multidomain structures. CATHEDRAL combines both secondary structure matching and accurate residue alignment in an iterative protocol for determining the location of previously observed folds in novel multi-domain structures. A rigorous benchmarking protocol is also described that assesses the performance of CATHEDRAL against other leading structural comparison methods. The optimisation and benchmarking of several other methods for detecting homology are subsequently presented. These include methods which exploit Hidden Markov Models (HMMs) to detect sequence similarity and methods that attempt to assess functional similarity. Finally an automated, machine learning approach to detecting homologous relationships between proteins is presented which combines information on sequence, structure and functional similarity. This was able to identify over 85% of the homologous relationships in the CATH classification at a 5% error rate. This thesis was gratefully supported by the Biotechnology and Biological Sciences Research Council
- β¦