626 research outputs found

    ALEC: Active learning with ensemble of classifiers for clinical diagnosis of coronary artery disease

    Get PDF
    Invasive angiography is the reference standard for coronary artery disease (CAD) diagnosis but is expensive and associated with certain risks. Machine learning (ML) using clinical and noninvasive imaging parameters can be used for CAD diagnosis to avoid the side effects and cost of angiography. However, ML methods require labeled samples for efficient training. The labeled data scarcity and high labeling costs can be mitigated by active learning. This is achieved through selective query of challenging samples for labeling. To the best of our knowledge, active learning has not been used for CAD diagnosis yet. An Active Learning with Ensemble of Classifiers (ALEC) method is proposed for CAD diagnosis, consisting of four classifiers. Three of these classifiers determine whether a patient’s three main coronary arteries are stenotic or not. The fourth classifier predicts whether the patient has CAD or not. ALEC is first trained using labeled samples. For each unlabeled sample, if the outputs of the classifiers are consistent, the sample along with its predicted label is added to the pool of labeled samples. Inconsistent samples are manually labeled by medical experts before being added to the pool. The training is performed once more using the samples labeled so far. The interleaved phases of labeling and training are repeated until all samples are labeled. Compared with 19 other active learning algorithms, ALEC combined with a support vector machine classifier attained superior performance with 97.01% accuracy. Our method is justified mathematically as well. We also comprehensively analyze the CAD dataset used in this paper. As part of dataset analysis, features pairwise correlation is computed. The top 15 features contributing to CAD and stenosis of the three main coronary arteries are determined. The relationship between stenosis of the main arteries is presented using conditional probabilities. The effect of considering the number of stenotic arteries on sample discrimination is investigated. The discrimination power over dataset samples is visualized, assuming each of the three main coronary arteries as a sample label and considering the two remaining arteries as sample features

    An analysis on ensemble learning optimized medical image classification with deep convolutional neural networks

    Get PDF
    Novel and high-performance medical image classification pipelines are heavily utilizing ensemble learning strategies. The idea of ensemble learning is to assemble diverse models or multiple predictions and, thus, boost prediction performance. However, it is still an open question to what extent as well as which ensemble learning strategies are beneficial in deep learning based medical image classification pipelines. In this work, we proposed a reproducible medical image classification pipeline for analyzing the performance impact of the following ensemble learning techniques: Augmenting, Stacking, and Bagging. The pipeline consists of state-of-the-art preprocessing and image augmentation methods as well as 9 deep convolution neural network architectures. It was applied on four popular medical imaging datasets with varying complexity. Furthermore, 12 pooling functions for combining multiple predictions were analyzed, ranging from simple statistical functions like unweighted averaging up to more complex learning-based functions like support vector machines. Our results revealed that Stacking achieved the largest performance gain of up to 13% F1-score increase. Augmenting showed consistent improvement capabilities by up to 4% and is also applicable to single model based pipelines. Cross-validation based Bagging demonstrated significant performance gain close to Stacking, which resulted in an F1-score increase up to +11%. Furthermore, we demonstrated that simple statistical pooling functions are equal or often even better than more complex pooling functions. We concluded that the integration of ensemble learning techniques is a powerful method for any medical image classification pipeline to improve robustness and boost performance.Comment: Code: https://github.com/frankkramer-lab/ensmic ; Supplementary Material: https://doi.org/10.5281/zenodo.645791

    Cost sensitive meta-learning

    Get PDF
    Classification is one of the primary tasks of data mining and aims to assign a class label to unseen examples by using a model learned from a training dataset. Most of the accepted classifiers are designed to minimize the error rate but in practice data mining involves costs such as the cost of getting the data, and cost of making an error. Hence the following question arises:Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?It is well known to the machine learning community that there is no single algorithm that performs best for all domains. This observation motivates the need to develop an “algorithm selector” which is the work of automating the process of choosing between different algorithms given a specific domain of application. Thus, this research develops a new meta-learning system for recommending cost-sensitive classification methods. The system is based on the idea of applying machine learning to discover knowledge about the performance of different data mining algorithms. It includes components that repeatedly apply different classification methods on data sets and measuring their performance. The characteristics of the data sets, combined with the algorithm and the performance provide the training examples. A decision tree algorithm is applied on the training examples to induce the knowledge which can then be applied to recommend algorithms for new data sets, and then active learning is used to automate the ability to choose the most informative data set that should enter the learning process.This thesis makes contributions to both the fields of meta-learning, and cost sensitive learning in that it develops a new meta-learning approach for recommending cost-sensitive methods. Although, meta-learning is not new, the task of accelerating the learning process remains an open problem, and the thesis develops a novel active learning strategy based on clustering that gives the learner the ability to choose which data to learn from and accordingly, speed up the meta-learning process.Both the meta-learning system and use of active learning are implemented in the WEKA system and evaluated by applying them on different datasets and comparing the results with existing studies available in the literature. The results show that the meta-learning system developed produces better results than METAL, a well-known meta-learning system and that the use of clustering and active learning has a positive effect on accelerating the meta-learning process, where all tested datasets show a decrement of error rate prediction by 75 %

    Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

    Get PDF
    In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem

    Predicting cervical cancer biopsy results using demographic and epidemiological parameters: a custom stacked ensemble machine learning approach

    Get PDF
    The human papillomavirus (HPV) is responsible for most cervical cancer cases worldwide. This gynecological carcinoma causes many deaths, even though it can be treated by removing malignant tissues at a preliminary stage. In many developing countries, patients do not undertake medical examinations due to the lack of awareness, hospital resources and high testing costs. Hence, it is vital to design a computer aided diagnostic method which can screen cervical cancer patients. In this research, we predict the probability risk of contracting this deadly disease using a custom stacked ensemble machine learning approach. The technique combines the results of several machine learning algorithms on multiple levels to produce reliable predictions. In the beginning, a deep exploratory analysis is conducted using univariate and multivariate statistics. Later, the one-way ANOVA, mutual information and Pearson’s correlation techniques are utilized for feature selection. Since the data was imbalanced, the Borderline-SMOTE technique was used to balance the data. The final stacked machine learning model obtained an accuracy, precision, recall, F1-score, area under curve (AUC) and average precision of 98%, 97%, 99%, 98%, 100% and 100%, respectively. To make the model explainable and interpretable to clinicians, explainable artificial intelligence algorithms such as Shapley additive values (SHAP), local interpretable model agnostic explanation (LIME), random forest and ELI5 have been effectively utilized. The optimistic results indicate the potential of automated frameworks to assist doctors and medical professionals in diagnosing and screening potential cervical cancer patients
    • …
    corecore