11 research outputs found

    Small margin ensembles can be robust to class-label noise

    Full text link
    This is the author’s version of a work that was accepted for publication in Neurocomputing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neurocomputing, VOL 160 (2015) DOI 10.1016/j.neucom.2014.12.086Subsampling is used to generate bagging ensembles that are accurate and robust to class-label noise. The effect of using smaller bootstrap samples to train the base learners is to make the ensemble more diverse. As a result, the classification margins tend to decrease. In spite of having small margins, these ensembles can be robust to class-label noise. The validity of these observations is illustrated in a wide range of synthetic and real-world classification tasks. In the problems investigated, subsampling significantly outperforms standard bagging for different amounts of class-label noise. By contrast, the effectiveness of subsampling in random forest is problem dependent. In these types of ensembles the best overall accuracy is obtained when the random trees are built on bootstrap samples of the same size as the original training data. Nevertheless, subsampling becomes more effective as the amount of class-label noise increases.The authors acknowledge financial support from Spanish Plan Nacional I+D+i Grant TIN2013-42351-P and from Comunidad de Madrid Grant S2013/ICE-2845 CASI-CAM-CM

    Label noise injection methods for model robustness assessment in fraud detection datasets

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLabel noise is a common issue in real-life applications of machine learning for fraud detection, that can lead to sub-optimal decisions during the model building phase, and, ultimately, lead to poor model performance. A key factor to the impact of noisy data on the performance of a model is the algorithm used to train and its robustness to label noise. In this work,we studied the robustness of the models generated by two different supervised tree-based algorithms, Random Forest and LightGBM, to different types of random and not at random artificial label noise injection techniques, at different percentages of noise, and using different datasets to both train and evaluate them. We also observed the impacts of label noise in the evaluation of the performance of a model. Finally, we analyzed the importance of the different hyperparameters of both algorithms in their performance.We show that both algorithms are robust to random label noise at different noise percentages, however they fail to separate between the classes when in the presence of noise not at random. We also show that, for random label noise, the correlation between the model performance over the noisy validation set and the test set decreases as we increase the noise percentage, however, for noise not at random there is no obvious correlation between the two sets. Finally, we conclude which hyperparameters are the most relevant for the performance of Random Forest models in the presence of random label noise, and in most cases, neither of the studied hyperparameters for LightGBM seem to be more relevant than the others for model performance.Um problema comum na aplicação de técnicas de aprendizagem automática para a deteção de fraude é a rotulagem incorreta das instâncias, que pode levar a decisões sub-ótimas durante a fase de construção do modelo, e assim levar a que o mesmo tenha baixo desempenho. Um fator-chave do impacto que a rotulagem incorreta tem no desempenho de um modelo é o algoritmo usado na sua construção e o quão robusto é. Neste trabalho, estudámos a robustez de modelos gerados através de dois tipos diferentes de algoritmos de aprendizagem supervisionado baseados em árvores de decisão, Random Forest e LightGBM, a diferentes tipos de métodos de injeção de ruído, uns aleatórios e outros determinísticos. Avaliámos os resultados adicionando diferentes percentagens de perturbação no treino e na validação e analisámos o impacto do ruído tanto no treino, como na avaliação do desempenho do modelo. Por fim, analisámos a importância dos diferentes hiper-parâmetros têm para o aumento do nível de desempenho do modelo. Os nossos resultados mostram que ambos os algoritmos são robustos a diferentes percentagens de rótulos incorretos, quando estes são introduzidos de forma aleatória, contudo os algoritmos não conseguem distinguir entre casos de fraude e de não fraude quando são usados métodos determinísticos. Vamos também mostrar que, para rótulos incorretos introduzidos de forma aleatória, a correlação entre o desempenho de um modelo nos dados de validação com ruído e o desempenho do modelo nos dados de teste sem ruído, diminui à medida que aumentamos a percentagem de rótulos incorretos. Porém, para métodos determinísticos de inserção de rótulos incorretos, não se verifica nenhuma correlação entre os conjuntos de dados. Concluímos quais os hiper-parâmetros que são mais relevantes para o desempenho dos modelos de Random Forest quando consideramos a inserção aleatória de rótulos incorretos, e que para LightGBM, na maior parte das vezes, nenhum dos hiper-parâmetros estudados se parece destacar quando consideramos o desempenho do modelo

    Estudio de conjuntos de clasificadores generados mediante el algoritmo class-switching

    Get PDF
    El objetivo principal de este trabajo es analizar, diseñar e implementar una tecnica que permita mejorar las prestaciones de un conjunto de clasificadores base. Esta técnica, conocida comúnmente de su nombre en inglés, class-switching, se basa en la introducción de diversidad en las etiquetas de las muestras de entrenamiento de cada clasificador base. Para comprobar si se han obtenido resultados positivos en prestaciones, se comparará la tasa de error cometida por el conjunto de clasificadores, cuando se aplica dicha técnica y se utilizan diferentes m etodos de combinación de las salidas de los mismos, con el error de base o baseline, que es el que obtendría con un único clasificador base. Utilizar esta técnica en conjuntos de clasificadores no es algo novedoso. A este respecto, los cambios propuestos en este TFG con respecto a otras publicaciones que hacen uso del método class-switching son: Se pretende estudiar el algoritmo con dos tipos de clasificadores base diferentes a los utilizados previamente: clasificadores basados en perceptrones multicapa (MLPs) y clasificadores basados en el algoritmo k-NN. Se utilizarán dos técnicas diferentes de combinar las salidas obtenidas de los distintos clasificadores base para estudiar si existen mejoras de una con respecto a la otra. Se tendrán en cuenta bases de datos diferentes entre sí en cuanto al número de muestras y características, así como número de clases distintas y proporciones de estas que las componen.Ingeniería en Tecnologías de Telecomunicació

    Ensemble Learning in the Presence of Noise

    Full text link
    Learning in the presence of noise is an important issue in machine learning. The design and implementation of e ective strategies for automatic induction from noisy data is particularly important in real-world problems, where noise from defective collecting processes, data contamination or intrinsic uctuations is ubiquitous. There are two general strategies to address this problem. One is to design a robust learning method. Another one is to identify noisy instances and eliminate or correct them. In this thesis we propose to use ensembles to mitigate the negative impact of mislabelled data in the learning process. In ensemble learning the predictions of individual learners are combined to obtain a nal decision. E ective combinations take advantage of the complementarity of these base learners. In this manner the errors incurred by a learner can be compensated by the predictions of other learners in the combination. A rst contribution of this work is the use of subsampling to build bootstrap ensembles, such as bagging and random forest, that are resilient to class label noise. By using lower sampling rates, the detrimental e ect of mislabelled examples on the nal ensemble decisions can be tempered. The reason is that each labelled instance is present in a smaller fraction of the training sets used to build individual learners. Ensembles can also be used as a noise detection procedure to improve the quality of the data used for training. In this strategy, one attempts to identify noisy instances and either correct (by switching their class label) or discard them. A particular example is identi ed as noise if a speci ed percentage (greater than 50%) of the learners disagree with the given label for this example. Using an extensive empirical evaluation we demonstrate the use of subsampling as an e ective tool to detect and handle noise in classi cation problems

    Considerando o ruído no aprendizado de modelos preditivos robustos para a filtragem colaborativa

    Get PDF
    In Recommendation Systems, it is named natural noise the inconsistencies that are introduced by a user. These inconsistencies affect the overall performance. Until then, data cleansing proposals have emerged with the objective to identify and correct these inconsistencies. However. approaches that consider noise in the learning process present a superior quality. Meanwhile, procedures for changing the cost function have arisen whose solution for the minimization of this with noisy data corresponds to the same solution using the original function with noiseless data. However, these procedures are dependent on previews knowledge of the noise distribution and in order to estimate it, certain assumptions regarding data are required. These conditions are not satisfied in collaborative filtering. In this work it is proposed to use these cost functions to construct a predictive model that considers noise in its learning. In addition, we present: (a) a class noise generation heuristic for collaborative filtering problems; (b) a baseline noise quantitative analysis; (c) robustness analysis of predictive models. In order to validate the proposal, three most representative datasets were selected for the problem. For such datasets, comparisons were made with state-of-the-art. Our results indicate that the proposal obtains superior prediction quality to the other methods in all the datasets and maintains a competitive robustness even when compared with the model that knows a priori the generator of the noise. Finally, a new direction is opened for methods that consider noise to the learning process of predictive models for collaborative filtering.Em sistemas de recomendação, denomina-se ruído natural as inconsistências que são introduzidas por um usuário. Inconsistências estas que são responsáveis por afetar o desempenho geral do recomendador. Até então, surgiram propostas de data cleansing que se baseiam em identificar essas avaliações inconsistentes e corrigi-las. Contudo, abordagens que consideram o ruído no processo de aprendizado apresentam qualidade superior. Neste cenário, surgiram procedimentos de alteração da função de custo, cuja solução para a minimização desta com dados ruidosos, corresponde à mesma solução utilizando a função original com dados sem ruído. Entretanto, estes são dependentes de um conhecimento a priori da distribuição do ruído e, para poder estimá-la, são necessárias certas suposições acerca dos dados. No caso da filtragem colaborativa, estas condições não são satisfeitas. Neste trabalho é proposta a utilização destas funções de custo para construir um modelo preditivo que considere o ruído no seu aprendizado. Adicionalmente, apresentamos: (a) uma heurística de geração de ruído de classe para problemas de filtragem colaborativa; (b) uma análise do quantitativo de ruído em bases; (c) análise da robustez de modelos preditivos. De forma a validar a proposta, foram selecionadas três bases mais representativas ao problema. Para tais bases, foram realizados comparativos com métodos do estado-da-arte. Nossos resultados indicam que a proposta obtém qualidade superior aos demais métodos em todas as bases e mantém uma robustez competitiva até mesmo quando se comparado com o modelo que conhece a priori o gerador do ruído. Por fim, abre-se um novo caminho para métodos que consideram ruído ao processo de aprendizado de modelos preditivos para filtragem colaborativa, e que, pesquisas nesta direção devem ser consideradas

    Ensemble learning in the presence of noise

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingenieria Informática. Fecha de lectura: 14-02-2019La disponibilidad de grandes cantidades de datos provenientes de diversas fuentes ampl a enormemente las posibilidades para una explotaci on inteligente de la informaci on. No obstante, la extracci on de conocimiento a partir de datos en bruto es una tarea compleja que requiere el desarrollo de m etodos de aprendizaje e cientes y robustos. Una de las principales di cultades en el aprendizaje autom atico es la presencia de ruido en los datos. En esta tesis, abordamos el problema del aprendizaje autom atico en presencia de ruido. Para este prop osito, nos centraremos en el uso de conjuntos de clasi cadores. Nuestro objetivo es crear colecciones de aprendices base cuyos resultados, al ser combinados, mejoren no solo la precisi on sino tambi en la robustez de las predicciones. Una primera contribuci on de esta tesis es aprovechar el ratio de submuestreo para construir conjuntos de clasi cadores basados en bootstrap (como bagging o random forests) precisos y robustos. La idea de utilizar el submuestreo como mecanismo de regularizaci on tambi en se explota para la detecci on de ejemplos ruidosos. En concreto, los ejemplos que est an mal clasi cados por una fracci on de los miembros del conjunto se marcan como ruido. El valor optimo de este umbral se determina mediante validaci on cruzada. Las instancias ruidosas se eliminan ( ltrado) o se corrigen sus etiquetas de su clase (limpieza). Finalmente, se construye un conjunto de clasi cadores utilizando los datos de entrenamiento limpios ( ltrados o limpiados). Otra contribuci on de esta tesis es vote-boosting, un m etodo de conjuntos secuencial especialmente dise~nado para ser robusto al ruido en las etiquetas de clase. Vote-boosting reduce la excesiva sensibilidad a este tipo de ruido de los algoritmos basados en boosting, como adaboost. En general, los algoritmos basados en booting modi can la distribuci on de pesos en los datos de entrenamiento progresivamente para enfatizar instancias mal clasi cadas. Este enfoque codicioso puede terminar dando un peso excesivamente alto a instancias cuya etiqueta de clase sea incorrecta. Por el contrario, en vote-boosting, el enfasis se basa en el nivel de incertidumbre (acuerdo o desacuerdo) de la predicci on del conjunto, independientemente de la etiqueta de clase. Al igual que en boosting, voteboosting se puede analizar como una optimizaci on de descenso por gradiente en espacio funcional. Uno de los problemas abiertos en el aprendizaje de conjuntos es c omo construir combinaciones de clasi cadores fuertes. La principal di cultad es lograr diversidad entre los clasi cadores base sin un deterioro signi cativo de su rendimiento y sin aumentar en exceso el coste computacional. En esta tesis, proponemos construir conjuntos de SVM con la ayuda de mecanismos de aleatorizaci on y optimizaci on. Gracias a esta combinaci on de estrategias complementarias, es posible crear conjuntos de SVM que son mucho m as r apidos de entrenar y son potencialmente m as precisos que un SVM individual optimizado. Por ultimo, hemos desarrollado un procedimiento para construir conjuntos heterog eneos que interpolan sus decisiones a partir de conjuntos homog eneos compuestos por diferentes tipos de clasi cadores. La composici on optima del conjunto se determina mediante validaci on cruzada. v

    Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

    Full text link
    Within machine learning, the supervised learning field aims at modeling the input-output relationship of a system, from past observations of its behavior. Decision trees characterize the input-output relationship through a series of nested ``if-then-else'' questions, the testing nodes, leading to a set of predictions, the leaf nodes. Several of such trees are often combined together for state-of-the-art performance: random forest ensembles average the predictions of randomized decision trees trained independently in parallel, while tree boosting ensembles train decision trees sequentially to refine the predictions made by the previous ones. The emergence of new applications requires scalable supervised learning algorithms in terms of computational power and memory space with respect to the number of inputs, outputs, and observations without sacrificing accuracy. In this thesis, we identify three main areas where decision tree methods could be improved for which we provide and evaluate original algorithmic solutions: (i) learning over high dimensional output spaces, (ii) learning with large sample datasets and stringent memory constraints at prediction time and (iii) learning over high dimensional sparse input spaces. A first approach to solve learning tasks with a high dimensional output space, called binary relevance or single target, is to train one decision tree ensemble per output. However, it completely neglects the potential correlations existing between the outputs. An alternative approach called multi-output decision trees fits a single decision tree ensemble targeting simultaneously all the outputs, assuming that all outputs are correlated. Nevertheless, both approaches have (i) exactly the same computational complexity and (ii) target extreme output correlation structures. In our first contribution, we show how to combine random projection of the output space, a dimensionality reduction method, with the random forest algorithm decreasing the learning time complexity. The accuracy is preserved, and may even be improved by reaching a different bias-variance tradeoff. In our second contribution, we first formally adapt the gradient boosting ensemble method to multi-output supervised learning tasks such as multi-output regression and multi-label classification. We then propose to combine single random projections of the output space with gradient boosting on such tasks to adapt automatically to the output correlation structure. The random forest algorithm often generates large ensembles of complex models thanks to the availability of a large number of observations. However, the space complexity of such models, proportional to their total number of nodes, is often prohibitive, and therefore these modes are not well suited under stringent memory constraints at prediction time. In our third contribution, we propose to compress these ensembles by solving a L1-based regularization problem over the set of indicator functions defined by all their nodes. Some supervised learning tasks have a high dimensional but sparse input space, where each observation has only a few of the input variables that have non zero values. Standard decision tree implementations are not well adapted to treat sparse input spaces, unlike other supervised learning techniques such as support vector machines or linear models. In our fourth contribution, we show how to exploit algorithmically the input space sparsity within decision tree methods. Our implementation yields a significant speed up both on synthetic and real datasets, while leading to exactly the same model. It also reduces the required memory to grow such models by exploiting sparse instead of dense memory storage for the input matrix.Parmi les techniques d'apprentissage automatique, l'apprentissage supervisé vise à modéliser les relations entrée-sortie d'un système, à partir d'observations de son fonctionnement. Les arbres de décision caractérisent cette relation entrée-sortie à partir d'un ensemble hiérarchique de questions appelées les noeuds tests amenant à une prédiction, les noeuds feuilles. Plusieurs de ces arbres sont souvent combinés ensemble afin d'atteindre les performances de l'état de l'art: les ensembles de forêts aléatoires calculent la moyenne des prédictions d'arbres de décision randomisés, entraînés indépendamment et en parallèle alors que les ensembles d'arbres de boosting entraînent des arbres de décision séquentiellement, améliorant ainsi les prédictions faites par les précédents modèles de l'ensemble. L'apparition de nouvelles applications requiert des algorithmes d'apprentissage supervisé efficaces en terme de puissance de calcul et d'espace mémoire par rapport au nombre d'entrées, de sorties, et d'observations sans sacrifier la précision du modèle. Dans cette thèse, nous avons identifié trois domaines principaux où les méthodes d'arbres de décision peuvent être améliorées pour lequel nous fournissons et évaluons des solutions algorithmiques originales: (i) apprentissage sur des espaces de sortie de haute dimension, (ii) apprentissage avec de grands ensembles d'échantillons et des contraintes mémoires strictes au moment de la prédiction et (iii) apprentissage sur des espaces d'entrée creux de haute dimension. Une première approche pour résoudre des tâches d'apprentissage avec un espace de sortie de haute dimension, appelée "binary relevance" ou "single target", est l’apprentissage d’un ensemble d'arbres de décision par sortie. Toutefois, cette approche néglige complètement les corrélations potentiellement existantes entre les sorties. Une approche alternative, appelée "arbre de décision multi-sorties", est l’apprentissage d’un seul ensemble d'arbres de décision pour toutes les sorties, faisant l'hypothèse que toutes les sorties sont corrélées. Cependant, les deux approches ont (i) exactement la même complexité en temps de calcul et (ii) visent des structures de corrélation de sorties extrêmes. Dans notre première contribution, nous montrons comment combiner des projections aléatoires (une méthode de réduction de dimensionnalité) de l'espace de sortie avec l'algorithme des forêts aléatoires diminuant la complexité en temps de calcul de la phase d'apprentissage. La précision est préservée, et peut même être améliorée en atteignant un compromis biais-variance différent. Dans notre seconde contribution, nous adaptons d'abord formellement la méthode d'ensemble "gradient boosting" à la régression multi-sorties et à la classification multi-labels. Nous proposons ensuite de combiner une seule projection aléatoire de l'espace de sortie avec l’algorithme de "gradient boosting" sur de telles tâches afin de s'adapter automatiquement à la structure des corrélations existant entre les sorties. Les algorithmes de forêts aléatoires génèrent souvent de grands ensembles de modèles complexes grâce à la disponibilité d'un grand nombre d'observations. Toutefois, la complexité mémoire, proportionnelle au nombre total de noeuds, de tels modèles est souvent prohibitive, et donc ces modèles ne sont pas adaptés à des contraintes mémoires fortes lors de la phase de prédiction. Dans notre troisième contribution, nous proposons de compresser ces ensembles en résolvant un problème de régularisation basé sur la norme L1 sur l'ensemble des fonctions indicatrices défini par tous leurs noeuds. Certaines tâches d'apprentissage supervisé ont un espace d'entrée de haute dimension mais creux, où chaque observation possède seulement quelques variables d'entrée avec une valeur non-nulle. Les implémentations standards des arbres de décision ne sont pas adaptées pour traiter des espaces d'entrée creux, contrairement à d'autres techniques d'apprentissage supervisé telles que les machines à vecteurs de support ou les modèles linéaires. Dans notre quatrième contribution, nous montrons comment exploiter algorithmiquement le creux de l'espace d'entrée avec les méthodes d'arbres de décision. Notre implémentation diminue significativement le temps de calcul sur des ensembles de données synthétiques et réelles, tout en fournissant exactement le même modèle. Cela permet aussi de réduire la mémoire nécessaire pour apprendre de tels modèles en exploitant des méthodes de stockage appropriées pour la matrice des entrées

    Class-switching Neural Network Ensembles

    No full text
    This article investigates the properties of class-switching ensembles composed of neural networks and compares them to class-switching ensembles of decision trees and to standard ensemble learning methods, such as bagging and boosting. In a class-switching ensemble, each learner is constructed using a modified version of the training data. This modification consists in switching the class labels of a fraction of training examples that are selected at random from the original training set. Experiments on 20 benchmark classification problems, including real-world and synthetic data, show that class-switching ensembles composed of neural networks can obtain significant improvements in the generalization accuracy over single neural networks and bagging and boosting ensembles. Furthermore, it is possible to build mediumsized ensembles ( ≈ 200 networks) whose classification performance is comparable to larger class-switching ensembles ( ≈ 1000 learners) of unpruned decision trees
    corecore