Search CORE

4,458 research outputs found

Recommended from our members

Multi-class protein fold classification using a new ensemble machine learning approach.

Author: Deville Y
Gilbert D
Tan A
Publication venue: GIW
Publication date: 01/01/2003
Field of study

Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as manual inspection of the protein structure become impossible. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. This work proposes a novel ensemble machine learning method that improves the coverage of the classifiers under the multi-class imbalanced sample sets by integrating knowledge induced from different base classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have compared our approach with PART and show that our method improves the sensitivity of the classifier in protein fold classification. Furthermore, we have extended this method to learning over multiple data types, preserving the independence of their corresponding data sources, and show that our new approach performs at least as well as the traditional technique over a single joined data source. These experimental results are encouraging, and can be applied to other bioinformatics problems similarly characterised by multi-class imbalanced data sets held in multiple data sources

Brunel University Research Archive

Recommended from our members

Integrative machine learning approach for multi-class SCOP protein fold classification

Author: Deville Y
Gilbert D
Tan A C
Publication venue: GCB
Publication date: 01/01/2003
Field of study

Classification and prediction of protein structure has been a central research theme in structural bioinformatics. Due to the imbalanced distribution of proteins over multi SCOP classification, most discriminative machine learning suffers the well-known ‘False Positives ’ problem when learning over these types of problems. We have devised eKISS, an ensemble machine learning specifically designed to increase the coverage of positive examples when learning under multiclass imbalanced data sets. We have applied eKISS to classify 25 SCOP folds and show that our learning system improved over classical learning methods

Brunel University Research Archive

Diversity and generalisation error in classification ensembles

Author: Ivascu C
Publication venue: 'Division of Chemical Information and Computer Sciences'
Publication date: 25/04/2024
Field of study

Ensembles are important tools in machine learning because they are often more accurate than single predictors. Although it has been shown that an accurate ensemble would benefit from having both accurate and diverse predictors, some studies in the literature could not support the influence that diversity has on the overall accuracy of an ensemble. In this thesis we are investigating the influence that diversity has on improving accuracy or equivalently reducing the generalisation error. There have been many diversity measures introduced in the literature, however as outlined in [1] the only one that had a strong negative correlation with generalisation error, was a diversity measure called ambiguity. The ambiguity measure was obtained by using the bias-variance decomposition of classifiers along with the 0-1 loss. As a result, our first set of experiments focuses on this type of diversity measure. We analyse the effect that the ambiguity measure has on decreasing the generalisation error of forests created by bootstrapping. We compare the effect of the ambiguity by having bootstrapping with or without replacement, by varying the number of trees, by varying the patterns or features used in building each tree. Our results show that bootstrapping without replacement yields lower test errors. A similar effect has been seen on bigger ensembles or by providing more data to the classifiers. We propose pruning approaches that involve ambiguity and compare their effect on the generalisation error versus a pruning method that promotes randomness. Our results show that there is no significant difference between the two types of approaches. Next, we define two new ambiguity measures derived from the cross entropy and hinge loss. We analyse their properties and find that out of the three ambiguity measures defined for classifiers (including the 0-1 loss introduced earlier), the only one that achieves all the desired properties of a diversity measure is the one obtained from the cross entropy (being always positive, and zero if and only if all the classifiers agree). We build ensembles by using bagging and by varying the sampling rates, we find that there is a negative correlation between generalisation error and diversity at high sampling rates; conversely generalisation error is positively correlated with diversity when the sampling rate is low and the diversity high. We use an evolutionary algorithm in order to maximise ambiguity and we find that the evolved ensemble in general has lower generalisation error than the initial ensemble. We define the term “ambiguous ensembles” as ensembles with high values of ambiguity. Additionally, we investigate the effect of pruning on larger ensembles and propose several pruning methods that prioritize ambiguity, as well as others that promote less ambiguous ensembles. Our results show that the approaches the prefer ambiguous ensembles reduce the generalisation error. Hence, our overall results support the influence that the diversity has on minimising generalisation error. Finally, we define diverse forests by building trees with different impurities. We choose families of impurities which are characterized by different parameters and we analyse the effect of choosing different parameters has on the generalisation performance. By tuning the parameters we can define symmetric or asymmetric impurities. In the case of imbalanced datasets the use of asymmetric impurities has been proven beneficial in predicting the minority class which usually is of big interest. We contrast the behaviour of the forests by using symmetric, asymmetric impurities with forests of trees built with different impurities (different parameters). Our results do not show a significant difference in performance

Open Research Exeter