244 research outputs found

    Building well-performing classifier ensembles: model and decision level combination.

    Get PDF
    There is a continuing drive for better, more robust generalisation performance from classification systems, and prediction systems in general. Ensemble methods, or the combining of multiple classifiers, have become an accepted and successful tool for doing this, though the reasons for success are not always entirely understood. In this thesis, we review the multiple classifier literature and consider the properties an ensemble of classifiers - or collection of subsets - should have in order to be combined successfully. We find that the framework of Stochastic Discrimination provides a well-defined account of these properties, which are shown to be strongly encouraged in a number of the most popular/successful methods in the literature via differing algorithmic devices. This uncovers some interesting and basic links between these methods, and aids understanding of their success and operation in terms of a kernel induced on the training data, with form particularly well suited to classification. One property that is desirable in both the SD framework and in a regression context, the ambiguity decomposition of the error, is de-correlation of individuals. This motivates the introduction of the Negative Correlation Learning method, in which neural networks are trained in parallel in a way designed to encourage de-correlation of the individual networks. The training is controlled by a parameter λ governing the extent to which correlations are penalised. Theoretical analysis of the dynamics of training results in an exact expression for the interval in which we can choose λ while ensuring stability of the training, and a value λ∗ for which the training has some interesting optimality properties. These values depend only on the size N of the ensemble. Decision level combination methods often result in a difficult to interpret model, and NCL is no exception. However in some applications, there is a need for understandable decisions and interpretable models. In response to this, we depart from the standard decision level combination paradigm to introduce a number of model level combination methods. As decision trees are one of the most interpretable model structures used in classification, we chose to combine structure from multiple individual trees to build a single combined model. We show that extremely compact, well performing models can be built in this way. In particular, a generalisation of bottom-up pruning to a multiple-tree context produces good results in this regard. Finally, we develop a classification system for a real-world churn prediction problem, illustrating some of the concepts introduced in the thesis, and a number of more practical considerations which are of importance when developing a prediction system for a specific problem

    Building well-performing classifier ensembles : model and decision level combination

    Get PDF
    There is a continuing drive for better, more robust generalisation performance from classification systems, and prediction systems in general. Ensemble methods, or the combining of multiple classifiers, have become an accepted and successful tool for doing this, though the reasons for success are not always entirely understood. In this thesis, we review the multiple classifier literature and consider the properties an ensemble of classifiers - or collection of subsets - should have in order to be combined successfully. We find that the framework of Stochastic Discrimination provides a well-defined account of these properties, which are shown to be strongly encouraged in a number of the most popular/successful methods in the literature via differing algorithmic devices. This uncovers some interesting and basic links between these methods, and aids understanding of their success and operation in terms of a kernel induced on the training data, with form particularly well suited to classification. One property that is desirable in both the SD framework and in a regression context, the ambiguity decomposition of the error, is de-correlation of individuals. This motivates the introduction of the Negative Correlation Learning method, in which neural networks are trained in parallel in a way designed to encourage de-correlation of the individual networks. The training is controlled by a parameter λ governing the extent to which correlations are penalised. Theoretical analysis of the dynamics of training results in an exact expression for the interval in which we can choose λ while ensuring stability of the training, and a value λ∗ for which the training has some interesting optimality properties. These values depend only on the size N of the ensemble. Decision level combination methods often result in a difficult to interpret model, and NCL is no exception. However in some applications, there is a need for understandable decisions and interpretable models. In response to this, we depart from the standard decision level combination paradigm to introduce a number of model level combination methods. As decision trees are one of the most interpretable model structures used in classification, we chose to combine structure from multiple individual trees to build a single combined model. We show that extremely compact, well performing models can be built in this way. In particular, a generalisation of bottom-up pruning to a multiple-tree context produces good results in this regard. Finally, we develop a classification system for a real-world churn prediction problem, illustrating some of the concepts introduced in the thesis, and a number of more practical considerations which are of importance when developing a prediction system for a specific problem.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Application and Extension of Weighted Quantile Sum Regression for the Development of a Clinical Risk Prediction Tool

    Get PDF
    In clinical settings, the diagnosis of medical conditions is often aided by measurement of various serum biomarkers through the use of laboratory tests. These biomarkers provide information about different aspects of a patient’s health and the overall function of different organs. In this dissertation, we develop and validate a weighted composite index that aggregates the information from a variety of health biomarkers covering multiple organ systems. The index can be used for predicting all-cause mortality and could also be used as a holistic measure of overall physiological health status. We refer to it as the Health Status Metric (HSM). Validation analysis shows that the HSM is predictive of long-term mortality risk and exhibits a robust association with concurrent chronic conditions, recent hospital utilization, and self-rated health. We develop the HSM using Weighted Quantile Sum (WQS) regression (Gennings et al., 2013; Carrico, 2013), a novel penalized regression technique that imposes nonnegativity and unit-sum constraints on the coefficients used to weight index components. In this dissertation, we develop a number of extensions to the WQS regression technique and apply them to the construction of the HSM. We introduce a new guided approach for the standardization of index components which accounts for potential nonlinear relationships with the outcome of interest. An extended version of the WQS that accommodates interaction effects among index components is also developed and implemented. In addition, we demonstrate that ensemble learning methods borrowed from the field of machine learning can be used to improve the predictive power of the WQS index. Specifically, we show that the use of techniques such as weighted bagging, the random subspace method and stacked generalization in conjunction with the WQS model can produce an index with substantially enhanced predictive accuracy. Finally, practical applications of the HSM are explored. A comparative study is performed to evaluate the feasibility and effectiveness of a number of ‘real-time’ imputation strategies in potential software applications for computing the HSM. In addition, the efficacy of the HSM as a predictor of hospital readmission is assessed in a cohort of emergency department patients

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    Machine Learning based Protein Sequence to (un)Structure Mapping and Interaction Prediction

    Get PDF
    Proteins are the fundamental macromolecules within a cell that carry out most of the biological functions. The computational study of protein structure and its functions, using machine learning and data analytics, is elemental in advancing the life-science research due to the fast-growing biological data and the extensive complexities involved in their analyses towards discovering meaningful insights. Mapping of protein’s primary sequence is not only limited to its structure, we extend that to its disordered component known as Intrinsically Disordered Proteins or Regions in proteins (IDPs/IDRs), and hence the involved dynamics, which help us explain complex interaction within a cell that is otherwise obscured. The objective of this dissertation is to develop machine learning based effective tools to predict disordered protein, its properties and dynamics, and interaction paradigm by systematically mining and analyzing large-scale biological data. In this dissertation, we propose a robust framework to predict disordered proteins given only sequence information, using an optimized SVM with RBF kernel. Through appropriate reasoning, we highlight the structure-like behavior of IDPs in disease-associated complexes. Further, we develop a fast and effective predictor of Accessible Surface Area (ASA) of protein residues, a useful structural property that defines protein’s exposure to partners, using regularized regression with 3rd-degree polynomial kernel function and genetic algorithm. As a key outcome of this research, we then introduce a novel method to extract position specific energy (PSEE) of protein residues by modeling the pairwise thermodynamic interactions and hydrophobic effect. PSEE is found to be an effective feature in identifying the enthalpy-gain of the folded state of a protein and otherwise the neutral state of the unstructured proteins. Moreover, we study the peptide-protein transient interactions that involve the induced folding of short peptides through disorder-to-order conformational changes to bind to an appropriate partner. A suite of predictors is developed to identify the residue-patterns of Peptide-Recognition Domains from protein sequence that can recognize and bind to the peptide-motifs and phospho-peptides with post-translational-modifications (PTMs) of amino acid, responsible for critical human diseases, using the stacked generalization ensemble technique. The involved biologically relevant case-studies demonstrate possibilities of discovering new knowledge using the developed tools

    Machine Learning Methodologies for Interpretable Compound Activity Predictions

    Get PDF
    Machine learning (ML) models have gained attention for mining the pharmaceutical data that are currently generated at unprecedented rates and potentially accelerate the discovery of new drugs. The advent of deep learning (DL) has also raised expectations in pharmaceutical research. A central task in drug discovery is the initial search of compounds with desired biological activity. ML algorithms are able to find patterns in compound structures that are related to bioactivity, the so-called structure-activity relationships (SARs). ML-based predictions can complement biological testing to prioritize further experiments. Moreover, insights into model decisions are highly desired for further validation and identification of activity-relevant substructures. However, the interpretation of complex ML models remains essentially prohibitive. This thesis focuses on ML-based predictions of compound activity against multiple biological targets. Single-target and multi-target models are generated for relevant tasks including the prediction of profiling matrices from screening data and the discrimination between weak and strong inhibitors for more than a hundred kinases. Moreover, the relative performance of distinct modeling strategies is systematically analyzed under varying training conditions, and practical guidelines are reported. Since explainable model decisions are a clear requirement for the utility of ML bioactivity models in pharmaceutical research, methods for the interpretation and intuitive visualization of activity predictions from any ML or DL model are introduced. Taken together, this dissertation presents contributions that advance in the application and rationalization of ML models for biological activity and SAR predictions

    Machine Learning Methods for Septic Shock Prediction

    Get PDF
    Sepsis is an organ dysfunction life-threatening disease that is caused by a dysregulated body response to infection. Sepsis is difficult to detect at an early stage, and when not detected early, is difficult to treat and results in high mortality rates. Developing improved methods for identifying patients in high risk of suffering septic shock has been the focus of much research in recent years. Building on this body of literature, this dissertation develops an improved method for septic shock prediction. Using the data from the MMIC-III database, an ensemble classifier is trained to identify high-risk patients. A robust prediction model is built by obtaining a risk score from fitting the Cox Hazard model on multiple input features. The score is added to the list of features and the Random Forest ensemble classifier is trained to produce the model. The Cox Enhanced Random Forest (CERF) proposed method is evaluated by comparing its predictive accuracy to those of extant methods
    • 

    corecore