4,458 research outputs found
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
Recommended from our members
Integrative machine learning approach for multi-class SCOP protein fold classification
Classification and prediction of protein structure has been a central research theme in structural bioinformatics. Due to the imbalanced distribution of proteins over multi SCOP classification, most discriminative machine learning suffers the well-known âFalse Positives â problem when learning over these types of problems. We have devised eKISS, an ensemble machine learning specifically designed to increase the coverage of positive examples when learning under multiclass imbalanced data sets. We have applied eKISS to classify 25 SCOP folds and show that our learning system improved over classical learning methods
Diversity and generalisation error in classification ensembles
Ensembles are important tools in machine learning because they are often more accurate
than single predictors. Although it has been shown that an accurate ensemble would
benefit from having both accurate and diverse predictors, some studies in the literature
could not support the influence that diversity has on the overall accuracy of an ensemble.
In this thesis we are investigating the influence that diversity has on improving accuracy
or equivalently reducing the generalisation error.
There have been many diversity measures introduced in the literature, however as outlined
in [1] the only one that had a strong negative correlation with generalisation error, was
a diversity measure called ambiguity. The ambiguity measure was obtained by using the
bias-variance decomposition of classifiers along with the 0-1 loss. As a result, our first
set of experiments focuses on this type of diversity measure. We analyse the effect that
the ambiguity measure has on decreasing the generalisation error of forests created by
bootstrapping. We compare the effect of the ambiguity by having bootstrapping with or
without replacement, by varying the number of trees, by varying the patterns or features
used in building each tree. Our results show that bootstrapping without replacement
yields lower test errors. A similar effect has been seen on bigger ensembles or by providing
more data to the classifiers. We propose pruning approaches that involve ambiguity and
compare their effect on the generalisation error versus a pruning method that promotes
randomness. Our results show that there is no significant difference between the two types
of approaches.
Next, we define two new ambiguity measures derived from the cross entropy and hinge
loss. We analyse their properties and find that out of the three ambiguity measures defined
for classifiers (including the 0-1 loss introduced earlier), the only one that achieves all the
desired properties of a diversity measure is the one obtained from the cross entropy (being
always positive, and zero if and only if all the classifiers agree). We build ensembles
by using bagging and by varying the sampling rates, we find that there is a negative
correlation between generalisation error and diversity at high sampling rates; conversely
generalisation error is positively correlated with diversity when the sampling rate is low
and the diversity high. We use an evolutionary algorithm in order to maximise ambiguity
and we find that the evolved ensemble in general has lower generalisation error than the
initial ensemble. We define the term âambiguous ensemblesâ as ensembles with high values
of ambiguity. Additionally, we investigate the effect of pruning on larger ensembles and
propose several pruning methods that prioritize ambiguity, as well as others that promote
less ambiguous ensembles. Our results show that the approaches the prefer ambiguous
ensembles reduce the generalisation error. Hence, our overall results support the influence
that the diversity has on minimising generalisation error.
Finally, we define diverse forests by building trees with different impurities. We choose
families of impurities which are characterized by different parameters and we analyse
the effect of choosing different parameters has on the generalisation performance. By
tuning the parameters we can define symmetric or asymmetric impurities. In the case
of imbalanced datasets the use of asymmetric impurities has been proven beneficial in
predicting the minority class which usually is of big interest. We contrast the behaviour
of the forests by using symmetric, asymmetric impurities with forests of trees built with
different impurities (different parameters). Our results do not show a significant difference
in performance
- âŠ