116 research outputs found

    Improved Weighted Random Forest for Classification Problems

    Get PDF
    Several studies have shown that combining machine learning models in an appropriate way will introduce improvements in the individual predictions made by the base models. The key to make well-performing ensemble model is in the diversity of the base models. Of the most common solutions for introducing diversity into the decision trees are bagging and random forest. Bagging enhances the diversity by sampling with replacement and generating many training data sets, while random forest adds selecting a random number of features as well. This has made the random forest a winning candidate for many machine learning applications. However, assuming equal weights for all base decision trees does not seem reasonable as the randomization of sampling and input feature selection may lead to different levels of decision-making abilities across base decision trees. Therefore, we propose several algorithms that intend to modify the weighting strategy of regular random forest and consequently make better predictions. The designed weighting frameworks include optimal weighted random forest based on ac-curacy, optimal weighted random forest based on the area under the curve (AUC), performance-based weighted random forest, and several stacking-based weighted random forest models. The numerical results show that the proposed models are able to introduce significant improvements compared to regular random forest

    Ensemble deep learning: A review

    Get PDF
    Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions

    Telemarketing outcome prediction using an Ensemblebased machine learning technique

    Get PDF
    Business organisations often use telemarketing, which is a form of direct marketing strategy to reach a wide range of customers within a short time. However, such marketing strategies need to target an appropriate subset of customers to offer them products/services instead of contacting everyone as people often get annoyed and disengaged when they receive pre-emptive communication. Machine learning techniques can aid in this scenario to select customers who are likely to positively respond to a telemarketing campaign. Business organisations can use their CRM-based customer information and embed machine learning techniques in the data analysis process to develop an automated decisionmaking system, which can recommend the set of customers to be communicated. A few works in the literature have used machine learning techniques to predict the outcome of telemarketing, however, the majority of them used a single classifier algorithm or used only a balanced dataset. To address this issue, this article proposes an ensemble-based machine learning technique to predict the outcome of telemarking, which works well even with an imbalanced dataset and achieves 90.29% accuracy

    Hypothesis-based machine learning for deep-water channel systems

    Get PDF
    2020 Spring.Includes bibliographical references.Machine learning algorithms are readily being incorporated into petroleum industry workflows for use in well-log correlation, prediction of rock properties, and seismic data interpretation. However, there is a clear disconnect between sedimentology and data analytics in these workflows because sedimentologic data is largely qualitative and descriptive. Sedimentology defines stratigraphic architecture and heterogeneity, which can greatly impact reservoir quality and connectivity and thus hydrocarbon recovery. Deep-water channel systems are an example where predicting reservoir architecture is critical to mitigating risk in hydrocarbon exploration. Deep-water reservoirs are characterized by spatial and temporal variations in channel body stacking patterns, which are difficult to predict with the paucity of borehole data and low quality seismic available in these remote locations. These stacking patterns have been shown to be a key variable that controls reservoir connectivity. In this study, the gap between sedimentology and data analytics is bridged using machine learning algorithms to predict stratigraphic architecture and heterogeneity in a deep-water slope channel system. The algorithms classify variables that capture channel stacking patterns (i.e., channel positions: axis, off-axis, and margin) from a database of outcrop statistics sourced from 68 stratigraphic measured sections from outcrops of the Upper Cretaceous Tres Pasos Formation at Laguna Figueroa in the Magallanes Basin, Chile. An initial hypothesis that channel position could be predicted from 1D descriptive sedimentologic data was tested with a series of machine learning algorithms and classification schemes. The results confirmed this hypothesis as complex algorithms (i.e., random forest, XGBoost, and neural networks) achieved accuracies above 80% while less complex algorithms (i.e., decision trees) achieved lower accuracies between 60%-70%. However, certain classes were difficult for the machine learning algorithms to classify, such as the transitional off-axis class. Additionally, an interpretive classification scheme performed better (by around 10%-20% in some cases) than a geometric scheme that was devised to remove interpretation bias. However, outcrop observations reveal that the interpretive classification scheme may be an over-simplified approach and that more heterogeneity likely exists in each class as revealed by the geometric scheme. A refined hypothesis was developed that a hierarchical machine learning approach could lend deeper insight into the heterogeneity within sedimentologic classes that are difficult for an interpreter to discern by observation alone. This hierarchical analysis revealed distinct sub-classes in the margin channel position that highlight variations in margin depositional style. The conceptual impact of these varying margin styles on fluid flow and connectivity is shown

    Machine Learning applied to credit risk assessment: Prediction of loan defaults

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDue to the recent financial crisis and regulatory concerns of Basel II, credit risk assessment is becoming a very important topic in the field of financial risk management. Financial institutions need to take great care when dealing with consumer loans in order to avoid losses and costs of opportunity. For this matter, credit scoring systems have been used to make informed decisions on whether or not to grant credit to clients who apply to them. Until now several credit scoring models have been proposed, from statistical models, to more complex artificial intelligence techniques. However, most of previous work is focused on employing single classifiers. Ensemble learning is a powerful machine learning paradigm which has proven to be of great value in solving a variety of problems. This study compares the performance of the industry standard, logistic regression, to four ensemble methods, i.e. AdaBoost, Gradient Boosting, Random Forest and Stacking in identifying potential loan defaults. All the models were built with a real world dataset with over one million customers from Lending Club, a financial institution based in the United States. The performance of the models was compared by using the Hold-out method as the evaluation design and accuracy, AUC, type I error and type II error as evaluation metrics. Experimental results reveal that the ensemble classifiers were able to outperform logistic regression on three key indicators, i.e. accuracy, type I error and type II error. AdaBoost performed better than the remaining classifiers considering a trade off between all the metrics evaluated. The main contribution of this thesis is an experimental addition to the literature on the preferred models for predicting potential loan defaulters
    corecore