Search CORE

435 research outputs found

Tree Boosting Data Competitions with XGBoost

Author: Bort Escabias Carlos
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 01/01/2017
Field of study

This Master's Degree Thesis objective is to provide understanding on how to approach a supervised learning predictive problem and illustrate it using a statistical/machine learning algorithm, Tree Boosting. A review of tree methodology is introduced in order to understand its evolution, since Classification and Regression Trees, followed by Bagging, Random Forest and, nowadays, Tree Boosting. The methodology is explained following the XGBoost implementation, which achieved state-of-the-art results in several data competitions. A framework for applied predictive modelling is explained with its proper concepts: objective function, regularization term, overfitting, hyperparameter tuning, k-fold cross validation and feature engineering. All these concepts are illustrated with a real dataset of videogame churn; used in a datathon competition

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey

Author: Algan Görkem
Ulusoy Ilkay
Publication venue: 'Elsevier BV'
Publication date: 11/01/2021
Field of study

Image classification systems recently made a giant leap with the advancement of deep neural networks. However, these systems require an excessive amount of labeled data to be adequately trained. Gathering a correctly annotated dataset is not always feasible due to several factors, such as the expensiveness of the labeling process or difficulty of correctly classifying data, even for the experts. Because of these practical challenges, label noise is a common problem in real-world datasets, and numerous methods to train deep neural networks with label noise are proposed in the literature. Although deep neural networks are known to be relatively robust to label noise, their tendency to overfit data makes them vulnerable to memorizing even random noise. Therefore, it is crucial to consider the existence of label noise and develop counter algorithms to fade away its adverse effects to train deep neural networks efficiently. Even though an extensive survey of machine learning techniques under label noise exists, the literature lacks a comprehensive survey of methodologies centered explicitly around deep learning in the presence of noisy labels. This paper aims to present these algorithms while categorizing them into one of the two subgroups: noise model based and noise model free methods. Algorithms in the first group aim to estimate the noise structure and use this information to avoid the adverse effects of noisy labels. Differently, methods in the second group try to come up with inherently noise robust algorithms by using approaches like robust losses, regularizers or other learning paradigms

arXiv.org e-Print Archive

OpenMETU (Middle East Technical University)

Utilizing Wearable Devices To Design Personal Thermal Comfort Model

Author: Alsaleem Fadi
Rafaie Mostafa
Tesfay Mehari
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2018
Field of study

Apart from the common environmental factors such as relative humidity, radiant and ambient temperatures, studies have confirmed that thermal comfort significantly depends on internal personal parameters such as metabolic rate, age and health status. This is manifested as a difference in comfort levels between people residing under the same roof, and hence no general comprehensive comfort model satisfying everyone. Current and newly emerging advancements in state of the art wearable technology have made it possible to continuously acquire biometric information. This work proposes to access and exploit this data to build personal thermal comfort model. Relying on various supervised machine learning methods, a personal thermal comfort model will be produced and compared to a general model to show its superior performance

DigitalCommons@University of Nebraska

Purdue E-Pubs

A review of homogenous ensemble methods on the classification of breast cancer data

Author: Mohd Arfian Ismail
Nur Farahaina Idris
Publication venue: SIGMA BOT
Publication date: 01/01/2024
Field of study

In the last decades, emerging data mining technology has been introduced to assist humankind in generating relevant decisions. Data mining is a concept established by computer scientists to lead a secure and reliable classification and deduction of data. In the medical field, data mining methods can assist in performing various medical diagnoses, including breast cancer. As evolution happens, ensemble methods are being proposed to achieve better performance in classification. This technique reinforced the use of multiple classifiers in the model. The review of the homogenous ensemble method on breast cancer classification is being carried out to identify the overall performance. The results of the reviewed ensemble techniques, such as Random Forest and XGBoost, show that ensemble methods can outperform the performance of the single classifier method. The reviewed ensemble methods have pros and cons and are useful for solving breast cancer classification problems. The methods are being discussed thoroughly to examine the overall performance in the classification

UMP Institutional Repository

Profiling Instances in Noise Reduction

Author: Delany Sarah Jane
MacNamee Brian
Segata Nicola
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2012
Field of study

The dependency on the quality of the training data has led to significant work in noise reduction for instance-based learning algorithms. This paper presents an empirical evaluation of current noise reduction techniques, not just from the perspective of their comparative performance, but from the perspective of investigating the types of instances that they focus on for re- moval. A novel instance profiling technique known as RDCL profiling allows the structure of a training set to be analysed at the instance level cate- gorising each instance based on modelling their local competence properties. This profiling approach o↵ers the opportunity of investigating the types of instances removed by the noise reduction techniques that are currently in use in instance-based learning. The paper also considers the e↵ect of removing instances with specific profiles from a dataset and shows that a very simple approach of removing instances that are misclassified by the training set and cause other instances in the dataset to be misclassified is an e↵ective noise reduction technique

Arrow@TUDublin

Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies

Author: A Caflisch
A Rocca
A Trovato
A Trovato
A Vezhnevets
B Adam
B Moret
C Kingsford
C Mateo
C Olanow
D Christendat
D Hawkins
DL Minor
E Dunkley
E Padlan
Eduardo A Padlan
F Ding
FJ Stevens
GG Tartaglia
Gisela P Concepcion
J Quinlan
J Tian
K Bennett
L Presta
L Presta
M Babyak
M Clark
M David
M David
M Gross
M Hurle
M Hurle
M Roguska
M Stefani
M Zanetti
MA Depristo
Maria Pamela C David
ML de la Paz
N Zavaljevski
O Conchillo-Solé
O Galzitskaya
P Geurts
R Vidal
RS Abraham
S Ewert
S Norton
S Srisailam
SM Behar
T Mitchell
TL Poshusta
V Perfetti
V Vega
V Villegas
VN Uversky
X Kang
Y Wang
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. Results The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%. Conclusions This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Learning with Noisy labels via Self-supervised Adversarial Noisy Masking

Author: Li Jian
Li Yuxi
Liu Liang
Tu Yuanpeng
Wang Chengjie
Wang Yabiao
Zhang Boshen
Zhang Jiangning
Zhao Cai Rong
Publication venue
Publication date: 15/02/2023
Field of study

Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods

arXiv.org e-Print Archive

Credit Risk Scoring: A Stacking Generalization Approach

Author: Raimundo Bernardo Dias
Publication venue
Publication date: 23/01/2023
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Risk Analysis and ManagementCredit risk regulation has been receiving tremendous attention, as a result of the effects of the latest global financial crisis. According to the developments made in the Internal Rating Based approach, under the Basel guidelines, banks are allowed to use internal risk measures as key drivers to assess the possibility to grant a loan to an applicant. Credit scoring is a statistical approach used for evaluating potential loan applications in both financial and banking institutions. When applying for a loan, an applicant must fill out an application form detailing its characteristics (e.g., income, marital status, and loan purpose) that will serve as contributions to a credit scoring model which produces a score that is used to determine whether a loan should be granted or not. This enables faster and consistent credit approvals and the reduction of bad debt. Currently, many machine learning and statistical approaches such as logistic regression and tree-based algorithms have been used individually for credit scoring models. Newer contemporary machine learning techniques can outperform classic methods by simply combining models. This dissertation intends to be an empirical study on a publicly available bank loan dataset to study banking loan default, using ensemble-based techniques to increase model robustness and predictive power. The proposed ensemble method is based on stacking generalization an extension of various preceding studies that used different techniques to further enhance the model predictive capabilities. The results show that combining different models provides a great deal of flexibility to credit scoring models

Repositório da Universidade Nova de Lisboa

Boosting bonsai trees for handwritten/printed text discrimination

Author: Coüasnon Bertrand
Lemaitre Aurélie
Poirriez Baptiste
Raymond Christian
Ricquebourg Yann
Publication venue: HAL CCSD
Publication date: 27/12/2013
Field of study

International audienceBoosting over decision-stumps proved its e ciency in Natural Language Processing essentially with symbolic features, and its good properties (fast, few and not critical parameters, not sensitive to overfitting) could be of great interest in the numeric world of pixel images. In this article we investigated the use of boosting over small decision trees, in image classification processing, for the discrimination of handwritten/printed text. Then, we conducted experiments to compare it to usual SVM-based classification revealing convincing results with very close performance, but with faster predictions and behaving far less as a black-box. Those promising results tend to make use of this classifier in more complex recognition tasks like multiclass problems

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Evaluation of Machine Learning Algorithm on Drinking Water Quality for Better Sustainability

Author: Kaddoura Sanaa
Publication venue: ZU Scholars
Publication date: 13/09/2022
Field of study

Water has become intricately linked to the United Nations\u27 sixteen sustainable development goals. Access to clean drinking water is crucial for health, a fundamental human right, and a component of successful health protection policies. Clean water is a significant health and development issue on a national, regional, and local level. Investments in water supply and sanitation have been shown to produce a net economic advantage in some areas because they reduce adverse health effects and medical expenses more than they cost to implement. However, numerous pollutants are affecting the quality of drinking water. This study evaluates the efficiency of using machine learning (ML) techniques in order to predict the quality of water. Thus, in this paper, a machine learning classifier model is built to predict the quality of water using a real dataset. First, significant features are selected. In the case of the used dataset, all measured characteristics are chosen. Data are split into training and testing subsets. A set of existing ML algorithms is applied, and the results are compared in terms of precision, recall, F1 score, and ROC curve. The results show that support vector machine and k-nearest neighbor are better according to F1-score and ROC AUC values. However, The LASSO LARS and stochastic gradient descent are better based on recall values

Multidisciplinary Digital Publishing Institute

ZU Scholars (Zayed University)