4 research outputs found
A Diversity-Accuracy Measure for Homogenous Ensemble Selection
Several selection methods in the literature are essentially based on an evaluation function that determines whether a model M contributes positively to boost the performances of the whole ensemble. In this paper, we propose a method called DIversity and ACcuracy for Ensemble Selection (DIACES) using an evaluation function based on both diversity and accuracy. The method is applied on homogenous ensembles composed of C4.5 decision trees and based on a hill climbing strategy. This allows selecting ensembles with the best compromise between maximum diversity and minimum error rate. Comparative studies show that in most cases the proposed method generates reduced size ensembles with better performances than usual ensemble simplification methods
Machine learning ensemble method for discovering knowledge from big data
Big data, generated from various business internet and social media activities, has
become a big challenge to researchers in the field of machine learning and data
mining to develop new methods and techniques for analysing big data effectively and
efficiently. Ensemble methods represent an attractive approach in dealing with the
problem of mining large datasets because of their accuracy and ability of utilizing the
divide-and-conquer mechanism in parallel computing environments.
This research proposes a machine learning ensemble framework and implements it
in a high performance computing environment. This research begins by identifying
and categorising the effects of partitioned data subset size on ensemble accuracy when
dealing with very large training datasets. Then an algorithm is developed to ascertain
the patterns of the relationship between ensemble accuracy and the size of partitioned
data subsets. The research concludes with the development of a selective modelling
algorithm, which is an efficient alternative to static model selection methods for big
datasets.
The results show that maximising the size of partitioned data subsets does not
necessarily improve the performance of an ensemble of classifiers that deal with large
datasets. Identifying the patterns exhibited by the relationship between ensemble
accuracy and partitioned data subset size facilitates the determination of the best subset
size for partitioning huge training datasets. Finally, traditional model selection is
inefficient in cases wherein large datasets are involved
High-dimensional and one-class classification
When dealing with high-dimensional data and, in particular, when the number of attributes p is large comparatively to the sample size n, several classification methods cannot be applied. Fisher's linear discriminant rule or the quadratic discriminant one are unfeasible, as the inverse of the involved covariance matrices cannot be computed. A recent approach to overcome this problem is based on Random Projections (RPs), which have emerged as a powerful method for dimensionality reduction. In 2017, Cannings and Samworth introduced the RP method in the ensemble context to extend to the high-dimensional domain classification methods originally designed for low-dimensional data. Although the RP ensemble classifier allows improving classification accuracy, it may still include redundant information. Moreover, differently from other ensemble classifiers (e.g. Random Forest), it does not provide any insight on the actual classification importance of the input features. To account for these aspects, in the first part of this thesis, we investigate two new directions of the RP ensemble classifier. Firstly, combining the original idea of using the Multiplicative Binomial distribution as the reference model to describe and predict the ensemble accuracy and an important result on such distribution, we introduce a stepwise strategy for post-pruning (called Ensemble Selection Algorithm). Secondly, we propose a criterion (called Variable Importance in Projection) that uses the feature coefficients in the best discriminant projections to measure the variable importance in classification. In the second part, we faced the new challenges posed by the high-dimensional data in a recently emerging classification context: one-class classification. This is a special classification task, where only one class is fully known (the target class), while the information on the others is completely missing. In particular, we address this task by using Gini's transvariation probability as a measure of typicality, aimed at identifying the best boundary around the target class