13 research outputs found

    Подход к построению ансамбля классификаторов с использованием генетического алгоритма

    Get PDF
    В статье рассматривается новый эволюционный подход к построению ансамбля классификаторов. Предложенный подход разработан на основе генетического алгоритма с модифицированной схемой реализации. В процессе оптимизации происходит определение параметров как отдельных классификаторов, так и всего ансамбля. С использованием подхода выполнено построение ансамбля классификаторов на нескольких наборах данных из архива данных по машинному обучению и на одном реальном наборе медицинских данных. Сравнительное тестирование показало преимущества использования предложенного подхода при работе с многомерными данными, характеризующимися большим количеством признаков.У статті розглядається новий еволюційний підхід до побудови ансамблю класифікаторів. Запропонований підхід розроблений на основі генетичного алгоритму з модифікованою схемою реалізації. У процесі оптимізації відбувається визначення параметрів як окремих класифікаторів, так і всього ансамблю. З використанням підходу виконана побудова ансамблю класифікаторів на декількох наборах даних з архіву даних по машинному навчанню й на одному реальному наборі медичних даних. Порівняльне тестування показало переваги використання запропонованого підходу при роботі з багатовимірними даними, що характеризуються більшою кількістю ознак.The paper proposes a new evolutionary approach to classifier ensemble design. The proposed approach is developed on the basis of genetic algorithm with modified realization scheme as applied to the optimization of feature set decomposition into the subsets, which define the individual ensemble’s classifiers and provide the high classification accuracy. During optimization both individual classifiers’ parameters and the ensemble parameters are defined. With the approach a few ensembles were designed for several datasets from machine learning database and for one real medical dataset. The comparative testing shows the advantages of the proposed approach for multivariate data analysis with great number of features

    Evolutionary Design of the Classifier Ensemble

    Get PDF
    This paper presents two novel approaches to evolutionary design of the classifier ensemble. The first one presents the task of one-objective optimization of feature set partitioning together with feature weighting for the construction of the inividual classifiers. The second approach deals with multi-objective optimization of classifier ensemble design. The proposed approaches have been tested on two data sets from the machine learning repository and one real data set on transient ischemic attack. The experiments show the advantages of the feature weighting in terms of classification accuracy when dealing with multivariate data sets and the possibility in one run of multi-objective genetic algorithm to get the non-dominated ensembles of different sizes and thereby skip the tedious process of iterative search for the best ensemble of fixed size.У статті запропоновано два нові підходи до еволюційної побудови ансамблю класифікаторів. Перший підхід є завданням одинкритерійної оптимізації розбиття безлічі ознак на окремі підмножини, які використовуються для побудови класифікаторів ансамблю. Другий підхід здійснює багатокритеріальну оптимізацію структури ансамблю класифікаторів.В статье предложены два новых подхода к эволюционному построению ансамбля классификаторов. Первый подход представляет собой задачу однокритериальной оптимизации разбиения множества признаков на отдельные подмножества, которые используются для построения классификаторов ансамбля. Второй подход осуществляет многокритериальную оптимизацию структуры ансамбля классификаторов

    Classification in high-dimensional feature spaces: Random subsample ensemble

    Get PDF
    Abstract-This paper presents application of machine learning ensembles, that randomly project the original high dimensional feature space onto multiple lower dimensional feature subspaces, to classification problems with highdimensional feature spaces. The motivation is to address challenges associated with algorithm scalability, data sparsity and information loss due to the so-called curse of dimensionality. The original high dimensional feature space is randomly projected onto a number of lower-dimensional feature subspaces. Each of these subspaces constitutes the domain of a classification subtask, and is associated with a base learner within an ensemble machine-learner context. Such an ensemble conceptualization is called as random subsample ensemble. Simulation results performed on data sets with up to 20,000 features indicate that the random subsample ensemble classifier performs comparably to other benchmark machine learners based on performance measures of prediction accuracy and cpu time. This finding establishes the feasibility of the ensemble and positions it to tackle classification problems with even much higher dimensional feature spaces

    Fusing diverse monitoring algorithms for robust change detection

    Full text link

    Weight-Selected Attribute Bagging for Credit Scoring

    Get PDF
    Assessment of credit risk is of great importance in financial risk management. In this paper, we propose an improved attribute bagging method, weight-selected attribute bagging (WSAB), to evaluate credit risk. Weights of attributes are first computed using attribute evaluation method such as linear support vector machine (LSVM) and principal component analysis (PCA). Subsets of attributes are then constructed according to weights of attributes. For each of attribute subsets, the larger the weights of the attributes the larger the probabilities by which they are selected into the attribute subset. Next, training samples and test samples are projected onto each attribute subset, respectively. A scoring model is then constructed based on each set of newly produced training samples. Finally, all scoring models are used to vote for test instances. An individual model that only uses selected attributes will be more accurate because of elimination of some of redundant and uninformative attributes. Besides, the way of selecting attributes by probability can also guarantee the diversity of scoring models. Experimental results based on two credit benchmark databases show that the proposed method, WSAB, is outstanding in both prediction accuracy and stability, as compared to analogous methods

    Tackling Distribution Shift - Detection and Mitigation

    Get PDF
    One of the biggest challenges of employing supervised deep learning approaches is their inability to perform as well beyond standardized datasets in real-world applications. Therefore, abrupt changes in the form of an outlier or overall changes in data distribution after model deployment result in a performance drop. Owing to these changes that induce distributional shifts, we propose two methodologies; the first is the detection of these shifts, and the second is adapting the model to overcome the low predictive performance due to these shifts. The former usually refers to anomaly detection, the process of finding patterns in the data that do not resemble the expected behavior. Understanding the behavior of data by capturing their distribution might help us to find those rare and uncommon samples without the need for annotated data. In this thesis, we exploit the ability of generative adversarial networks (GANs) in capturing the latent representation to design a model that differentiates the expected behavior from deviated samples. Furthermore, we integrate self-supervision into generative adversarial networks to improve the predictive performance of our proposed anomaly detection model. In addition, to shift detection, we propose an ensemble approach to adapt a model under varied distributional shifts using domain adaptation. In summary, this thesis focuses on detecting shifts under the umbrella of anomaly detection as well as mitigating the effect of several distributional shifts by adapting deep learning models using a Bayesian and information theory approach

    Ensemble-based Supervised Learning for Predicting Diabetes Onset

    Get PDF
    The research presented in this thesis aims to address the issue of undiagnosed diabetes cases. The current state of knowledge is that one in seventy people in the United Kingdom are living with undiagnosed diabetes, and only one in a hundred people could identify the main signs of diabetes. Some of the tools available for predicting diabetes are either too simplistic and/or rely on superficial data for inference. On the positive side, the National Health Service (NHS) are improving data recording in this domain by offering health check to adults aged 40 - 70. Data from such programme could be utilised to mitigate the issue of superficial data; but also help to develop a predictive tool that facilitates a change from the current reactive care, onto one that is proactive. This thesis presents a tool based on a machine learning ensemble for predicting diabetes onset. Ensembles often perform better than a single classifier, and accuracy and diversity have been highlighted as the two vital requirements for constructing good ensemble classifiers. Experiments in this thesis explore the relationship between diversity from heterogeneous ensemble classifiers and the accuracy of predictions through feature subset selection in order to predict diabetes onset. Data from a national health check programme (similar to NHS health check) was used. The aim is to predict diabetes onset better than other similar studies within the literature. For the experiments, predictions from five base classifiers (Sequential Minimal Optimisation (SMO), Radial Basis Function (RBF), Naïve Bayes (NB), Repeated Incremental Pruning to Produce Error Reduction (RIPPER) and C4.5 decision tree), performing the same task, are exploited in all possible combinations to construct 26 ensemble models. The training data feature space was searched to select the best feature subset for each classifier. Selected subsets are used to train the classifiers and their predictions are combined using k-Nearest Neighbours algorithm as meta-classifier. Results are analysed using four performance metrics (accuracy, sensitivity, specificity and AUC) to determine (i) if ensembles always perform better than single classifier; and (ii) the impact of diversity (from heterogeneous classifiers) and accuracy (through feature subset selection) on ensemble performance. At base classification level, RBF produced better results than the other four classifiers with 78%accuracy, 82% sensitivity, 73% specificity and 85% AUC. A comparative study shows that RBF model is more accurate than 9 ensembles, more sensitive than 13 ensembles, more specific than 9 ensembles; and produced better AUC than 25 ensembles. This means that ensembles do not always perform better than its constituent classifiers. Of those ensembles that performed better than RBF, the combination of C4.5, RIPPER and NB produced the highest results with 83% accuracy, 87% sensitivity, 79% specificity, and 86% AUC. When compared to the RBF model, the result shows 5.37% accuracy improvement which is significant (p = 0.0332). The experiments show how data from medical health examination can be utilised to address the issue of undiagnosed cases of diabetes. Models constructed with such data would facilitate the much desired shift from preventive to proactive care for individuals at high risk of diabetes. From the machine learning view point, it was established that ensembles constructed based on diverse and accurate base learners, have the potential to produce significant improvement in accuracy, compared to its individual constituent classifiers. In addition, the ensemble presented in this thesis is at least 1% and at most 23% more accurate than similar research studies found within the literature. This validates the superiority of the method implemented

    Decimated Input Ensembles for Improved Generalization

    No full text
    Using an ensemble of classifiers instead of a single classifier has been demonstrated to improve generalization performance in many difficult problems. However, for this improvement to take place it is necessary to make the classifiers in an ensemble more complementary. In this paper, we highlight the need to reduce the correlation among the component classifiers and investigate one method for correlation reduction: input decimation. We elaborate on input decimation, a method that uses the discriminating features of the inputs to decouple classifiers. By presenting different parts of the feature set to each individual classifier, input decimation generates a diverse pool of classifiers. Experimental results confirm that input decimation combining improves generalization performance

    Decimated Input Ensembles for Improved Generalization

    No full text
    Recently, many researchers have demonstrated that using classifier ensembles (e.g., averaging the outputs of multiple classifiers before reaching a classification decision) leads to improved performance for many difficult generalization problems. However, in many domains there are serious impediments to such "turnkey" classification accuracy improvements. Most notable among these is the deleterious effect of highly correlated classifiers on the ensemble performance. One particular solution to this problem is generating "new" training sets by sampling the original one. However, with finite number of patterns, this causes a reduction in the training patterns each classifier sees, often resulting in considerably worsened generalization performance (particularly for high dimensional data domains) for each individual classifier. Generally, this drop in the accuracy of the individual classifier performance more than offsets any potential gains due to combining, unless diversity among classifiers is actively promoted. In this work, we introduce a method that: (1) reduces the correlation among the classifiers; (2) reduces the dimensionality of the data, thus lessening the impact of the 'curse of dimensionality'; and (3) improves the classification performance of the ensemble
    corecore