435 research outputs found

    Tree Boosting Data Competitions with XGBoost

    Get PDF
    This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

    Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey

    Full text link
    Image classification systems recently made a giant leap with the advancement of deep neural networks. However, these systems require an excessive amount of labeled data to be adequately trained. Gathering a correctly annotated dataset is not always feasible due to several factors, such as the expensiveness of the labeling process or difficulty of correctly classifying data, even for the experts. Because of these practical challenges, label noise is a common problem in real-world datasets, and numerous methods to train deep neural networks with label noise are proposed in the literature. Although deep neural networks are known to be relatively robust to label noise, their tendency to overfit data makes them vulnerable to memorizing even random noise. Therefore, it is crucial to consider the existence of label noise and develop counter algorithms to fade away its adverse effects to train deep neural networks efficiently. Even though an extensive survey of machine learning techniques under label noise exists, the literature lacks a comprehensive survey of methodologies centered explicitly around deep learning in the presence of noisy labels. This paper aims to present these algorithms while categorizing them into one of the two subgroups: noise model based and noise model free methods. Algorithms in the first group aim to estimate the noise structure and use this information to avoid the adverse effects of noisy labels. Differently, methods in the second group try to come up with inherently noise robust algorithms by using approaches like robust losses, regularizers or other learning paradigms

    Utilizing Wearable Devices To Design Personal Thermal Comfort Model

    Get PDF
    Apart from the common environmental factors such as relative humidity, radiant and ambient temperatures, studies have confirmed that thermal comfort significantly depends on internal personal parameters such as metabolic rate, age and health status. This is manifested as a difference in comfort levels between people residing under the same roof, and hence no general comprehensive comfort model satisfying everyone. Current and newly emerging advancements in state of the art wearable technology have made it possible to continuously acquire biometric information. This work proposes to access and exploit this data to build personal thermal comfort model. Relying on various supervised machine learning methods, a personal thermal comfort model will be produced and compared to a general model to show its superior performance

    A review of homogenous ensemble methods on the classification of breast cancer data

    Get PDF
    In the last decades, emerging data mining technology has been introduced to assist humankind in generating relevant decisions. Data mining is a concept established by computer scientists to lead a secure and reliable classification and deduction of data. In the medical field, data mining methods can assist in performing various medical diagnoses, including breast cancer. As evolution happens, ensemble methods are being proposed to achieve better performance in classification. This technique reinforced the use of multiple classifiers in the model. The review of the homogenous ensemble method on breast cancer classification is being carried out to identify the overall performance. The results of the reviewed ensemble techniques, such as Random Forest and XGBoost, show that ensemble methods can outperform the performance of the single classifier method. The reviewed ensemble methods have pros and cons and are useful for solving breast cancer classification problems. The methods are being discussed thoroughly to examine the overall performance in the classification

    Profiling Instances in Noise Reduction

    Get PDF
    The dependency on the quality of the training data has led to significant work in noise reduction for instance-based learning algorithms. This paper presents an empirical evaluation of current noise reduction techniques, not just from the perspective of their comparative performance, but from the perspective of investigating the types of instances that they focus on for re- moval. A novel instance profiling technique known as RDCL profiling allows the structure of a training set to be analysed at the instance level cate- gorising each instance based on modelling their local competence properties. This profiling approach o↔ers the opportunity of investigating the types of instances removed by the noise reduction techniques that are currently in use in instance-based learning. The paper also considers the e↔ect of removing instances with specific profiles from a dataset and shows that a very simple approach of removing instances that are misclassified by the training set and cause other instances in the dataset to be misclassified is an e↔ective noise reduction technique

    Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences.</p> <p>Results</p> <p>The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%.</p> <p>Conclusions</p> <p>This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.</p

    Learning with Noisy labels via Self-supervised Adversarial Noisy Masking

    Full text link
    Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods

    Credit Risk Scoring: A Stacking Generalization Approach

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models

    Boosting bonsai trees for handwritten/printed text discrimination

    Get PDF
    International audienceBoosting over decision-stumps proved its e ciency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to overfitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems

    Evaluation of Machine Learning Algorithm on Drinking Water Quality for Better Sustainability

    Get PDF
    Water has become intricately linked to the United Nations\u27 sixteen sustainable development goals. Access to clean drinking water is crucial for health, a fundamental human right, and a component of successful health protection policies. Clean water is a significant health and development issue on a national, regional, and local level. Investments in water supply and sanitation have been shown to produce a net economic advantage in some areas because they reduce adverse health effects and medical expenses more than they cost to implement. However, numerous pollutants are affecting the quality of drinking water. This study evaluates the efficiency of using machine learning (ML) techniques in order to predict the quality of water. Thus, in this paper, a machine learning classifier model is built to predict the quality of water using a real dataset. First, significant features are selected. In the case of the used dataset, all measured characteristics are chosen. Data are split into training and testing subsets. A set of existing ML algorithms is applied, and the results are compared in terms of precision, recall, F1 score, and ROC curve. The results show that support vector machine and k-nearest neighbor are better according to F1-score and ROC AUC values. However, The LASSO LARS and stochastic gradient descent are better based on recall values
    • 

    corecore