382 research outputs found

    Differential Evolution Algorithm in the Construction of Interpretable Classification Models

    Get PDF
    In this chapter, the application of a differential evolution-based approach to induce oblique decision trees (DTs) is described. This type of decision trees uses a linear combination of attributes to build oblique hyperplanes dividing the instance space. Oblique decision trees are more compact and accurate than the traditional univariate decision trees. On the other hand, as differential evolution (DE) is an efficient evolutionary algorithm (EA) designed to solve optimization problems with real-valued parameters, and since finding an optimal hyperplane is a hard computing task, this metaheuristic (MH) is chosen to conduct an intelligent search of a near-optimal solution. Two methods are described in this chapter: one implementing a recursive partitioning strategy to find the most suitable oblique hyperplane of each internal node of a decision tree, and the other conducting a global search of a near-optimal oblique decision tree. A statistical analysis of the experimental results suggests that these methods show better performance as decision tree induction procedures in comparison with other supervised learning approaches

    A survey of cost-sensitive decision tree induction algorithms

    Get PDF
    The past decade has seen a significant interest on the problem of inducing decision trees that take account of costs of misclassification and costs of acquiring the features used for decision making. This survey identifies over 50 algorithms including approaches that are direct adaptations of accuracy based methods, use genetic algorithms, use anytime methods and utilize boosting and bagging. The survey brings together these different studies and novel approaches to cost-sensitive decision tree learning, provides a useful taxonomy, a historical timeline of how the field has developed and should provide a useful reference point for future research in this field

    Experimental and Data-driven Workflows for Microstructure-based Damage Prediction

    Get PDF
    Materialermüdung ist die häufigste Ursache für mechanisches Versagen. Die Degradationsmechanismen, welche die Lebensdauer von Bauteilen bei vergleichsweise ausgeprägten zyklischen Belastungen bestimmen, sind gut bekannt. Bei Belastungen im makroskopisch elastischen Bereich hingegen, der (sehr) hochzyklischen Ermüdung, bestimmen die innere Struktur eines Werkstoffs und die Wechselwirkung kristallografischer Defekte die Lebensdauer. Unter diesen Umständen sind die inneren Degradationsphänomene auf der mikroskopischen Skala weitgehend reversibel und führen nicht zur Bildung kritischer Schädigungen, die kontinuierlich wachsen können. Allerdings sind einige Kornensembles in polykristallinen Metallen, je nach den lokalen mikrostrukturellen Gegebenheiten, anfällig für Schädigungsinitiierung, Rissbildung und -wachstum und wirken daher als Schwachstellen. Daher weisen Bauteile, die solchen Belastungen ausgesetzt sind, oft eine ausgeprägte Lebensdauerstreuung auf. Die Tatsache, dass ein umfassendes mechanistisches Verständnis für diese Degradationsprozesse in verschiedenen Werkstoffen nicht vorliegt, hat zur Folge, dass die derzeitigen Modellierungsbemühungen die mittlere Lebensdauer und ihre Varianz in der Regel nur mit unbefriedigender Genauigkeit vorhersagen. Dies wiederum erschwert die Bauteilauslegung und macht die Nutzung von Sicherheitsfaktoren während des Dimensionierungsprozesses erforderlich. Abhilfe kann geschaffen werden, indem umfangreiche Daten zu Einflussfaktoren und deren Wirkung auf die Bildung initialer Ermüdungsschädigungen erhoben werden. Die Datenknappheit wirkt sich nach wie vor negativ auf Datenwissenschaftler und Modellierungsexperten aus, die versuchen, trotz geringer Stichprobengröße und unvollständigen Merkmalsräumen, mikrostrukturelle Abhängigkeiten abzuleiten, datengetriebene Vorhersagemodelle zu trainieren oder physikalische, regelbasierte Modelle zu parametrisieren. Die Tatsache, dass nur wenige kritische Schädigungen bezogen auf das gesamte Probenvolumen auftreten und die hochzyklische Ermüdung eine Vielzahl unterschiedlicher Abhängigkeiten aufweist, impliziert einige Anforderungen an die Datenerfassung und -verarbeitung. Am wichtigsten ist, dass die Messtechniken so empfindlich sind, dass nuancierte Schwankungen im Probenzustand erfasst werden können, dass die gesamte Routine effizient ist und dass die korrelative Mikroskopie räumliche Informationen aus verschiedenen Messungen miteinander verbindet. Das Hauptziel dieser Arbeit besteht darin, einen Workflow zu etablieren, der den Datenmangel behebt, so dass die zukünftige virtuelle Auslegung von Komponenten effizienter, zuverlässiger und nachhaltiger gestaltet werden kann. Zu diesem Zweck wird in dieser Arbeit ein kombinierter experimenteller und datenverarbeitender Workflow vorgeschlagen, um multimodale Datensätze zu Ermüdungsschädigungen zu erzeugen. Der Schwerpunkt liegt dabei auf dem Auftreten von lokalen Gleitbändern, der Rissinitiierung und dem Wachstum mikrostrukturell kurzer Risse. Der Workflow vereint die Ermüdungsprüfung von mesoskaligen Proben, um die Empfindlichkeit der Schädigungsdetektion zu erhöhen, die ergänzende Charakterisierung, die multimodale Registrierung und Datenfusion der heterogenen Daten, sowie die bildverarbeitungsbasierte Schädigungslokalisierung und -bewertung. Mesoskalige Biegeresonanzprüfung ermöglicht das Erreichen des hochzyklischen Ermüdungszustands in vergleichsweise kurzen Zeitspannen bei gleichzeitig verbessertem Auflösungsvermögen der Schädigungsentwicklung. Je nach Komplexität der einzelnen Bildverarbeitungsaufgaben und Datenverfügbarkeit werden entweder regelbasierte Bildverarbeitungsverfahren oder Repräsentationslernen gezielt eingesetzt. So sorgt beispielsweise die semantische Segmentierung von Schädigungsstellen dafür, dass wichtige Ermüdungsmerkmale aus mikroskopischen Abbildungen extrahiert werden können. Entlang des Workflows wird auf einen hohen Automatisierungsgrad Wert gelegt. Wann immer möglich, wurde die Generalisierbarkeit einzelner Workflow-Elemente untersucht. Dieser Workflow wird auf einen ferritischen Stahl (EN 1.4003) angewendet. Der resultierende Datensatz verknüpft unter anderem große verzerrungskorrigierte Mikrostrukturdaten mit der Schädigungslokalisierung und deren zyklischer Entwicklung. Im Zuge der Arbeit wird der Datensatz wird im Hinblick auf seinen Informationsgehalt untersucht, indem detaillierte, analytische Studien zur einzelnen Schädigungsbildung durchgeführt werden. Auf diese Weise konnten unter anderem neuartige, quantitative Erkenntnisse über mikrostrukturinduzierte plastische Verformungs- und Rissstopmechanismen gewonnen werden. Darüber hinaus werden aus dem Datensatz abgeleitete kornweise Merkmalsvektoren und binäre Schädigungskategorien verwendet, um einen Random-Forest-Klassifikator zu trainieren und dessen Vorhersagegüte zu bewerten. Der vorgeschlagene Workflow hat das Potenzial, die Grundlage für künftiges Data Mining und datengetriebene Modellierung mikrostrukturempfindlicher Ermüdung zu legen. Er erlaubt die effiziente Erhebung statistisch repräsentativer Datensätze mit gleichzeitig hohem Informationsgehalt und kann auf eine Vielzahl von Werkstoffen ausgeweitet werden

    DATA MINING AND IMAGE CLASSIFICATION USING GENETIC PROGRAMMING

    Get PDF
    Genetic programming (GP), a capable machine learning and search method, motivated by Darwinian-evolution, is an evolutionary learning algorithm which automatically evolves computer programs in the form of trees to solve problems. This thesis studies the application of GP for data mining and image processing. Knowledge discovery and data mining have been widely used in business, healthcare, and scientific fields. In data mining, classification is supervised learning that identifies new patterns and maps the data to predefined targets. A GP based classifier is developed in order to perform these mappings. GP has been investigated in a series of studies to classify data; however, there are certain aspects which have not formerly been studied. We propose an optimized GP classifier based on a combination of pruning subtrees and a new fitness function. An orthogonal least squares algorithm is also applied in the training phase to create a robust GP classifier. The proposed GP classifier is validated by 10-fold cross validation. Three areas were studied in this thesis. The first investigation resulted in an optimized genetic-programming-based classifier that directly solves multi-class classification problems. Instead of defining static thresholds as boundaries to differentiate between multiple labels, our work presents a method of classification where a GP system learns the relationships among experiential data and models them mathematically during the evolutionary process. Our approach has been assessed on six multiclass datasets. The second investigation was to develop a GP classifier to segment and detect brain tumors on magnetic resonance imaging (MRI) images. The findings indicated the high accuracy of brain tumor classification provided by our GP classifier. The results confirm the strong ability of the developed technique for complicated image classification problems. The third was to develop a hybrid system for multiclass imbalanced data classification using GP and SMOTE which was tested on satellite images. The finding showed that the proposed approach improves both training and test results when the SMOTE technique is incorporated. We compared our approach in terms of speed with previous GP algorithms as well. The analyzed results illustrate that the developed classifier produces a productive and rapid method for classification tasks that outperforms the previous methods for more challenging multiclass classification problems. We tested the approaches presented in this thesis on publicly available datasets, and images. The findings were statistically tested to conclude the robustness of the developed approaches

    Local Rule-Based Explanations of Black Box Decision Systems

    Get PDF
    The recent years have witnessed the rise of accurate but obscure decision systems which hide the logic of their internal decision processes to the users. The lack of explanations for the decisions of black box systems is a key ethical issue, and a limitation to the adoption of machine learning components in socially sensitive and safety-critical contexts. %Therefore, we need explanations that reveals the reasons why a predictor takes a certain decision. In this paper we focus on the problem of black box outcome explanation, i.e., explaining the reasons of the decision taken on a specific instance. We propose LORE, an agnostic method able to provide interpretable and faithful explanations. LORE first leans a local interpretable predictor on a synthetic neighborhood generated by a genetic algorithm. Then it derives from the logic of the local interpretable predictor a meaningful explanation consisting of: a decision rule, which explains the reasons of the decision; and a set of counterfactual rules, suggesting the changes in the instance's features that lead to a different outcome. Wide experiments show that LORE outperforms existing methods and baselines both in the quality of explanations and in the accuracy in mimicking the black box

    Enhancing Classification and Regression Tree-Based Models by means of Mathematical Optimization

    Get PDF
    This PhD dissertation bridges the disciplines of Operations Research and Machine Learning by developing novel Mathematical Optimization formulations and numerical solution approaches to build classification and regression tree-based models. Contrary to classic classification and regression trees, built in a greedy heuristic manner, formulating the design of the tree model as an optimization problem allows us to easily include, either as hard or soft constraints, desirable global structural properties. In this PhD dissertation, we illustrate this flexibility to model: sparsity, as a proxy for interpretability, by controlling the number of non-zero coefficients, the number of predictor variables and, in the case of functional ones, the proportion of the domain used for prediction; an important social criterion, the fairness of the model, which aims to avoid predictions that discriminate against race, or other sensitive features; and the cost-sensitivity for groups at risk, by ensuring an acceptable accuracy performance for them. Moreover, we provide in a natural way the impact that continuous predictor variables have on each individual prediction, thus enhancing the local explainability of tree models. All the approaches proposed in this thesis are formulated through Continuous Optimization problems that are scalable with respect to the size of the training sample, are studied theoretically, are tested in real data sets and are competitive in terms of prediction accuracy against benchmarks. This, together with the good properties summarized above, is illustrated through the different chapters of this thesis. This PhD dissertation is organized as follows. The state of the art in the field of (optimal) decision trees is fully discussed in Chapter 1, while the next four chapters state our methodology. Chapter 2 introduces in detail the general framework that threads the chapters in this thesis: a randomized tree with oblique cuts. Particularly, we present our proposal to deal with classification problems, which naturally provides probabilistic output on class membership tailored to each individual, in contrast to the most popular existing approaches, where all individuals in the same leaf node are assigned the same probability. Preferences on classification rates in critical classes are successfully handled through cost-sensitive constraints. Chapter 3 extends the methodology for classification in Chapter 2 to additionally handle sparsity. This is modeled by means of regularizations with polyhedral norms added to the objective function. The sparsest tree case is theoretically studied. Our ability to easily trade in some of our classification accuracy for a gain in sparsity is shown. In Chapter 4, the findings obtained in Chapters 2 and 3 are adapted to construct sparse trees for regression. Theoretical properties of the solutions are explored. The scalability of our approach with respect to the size of the training sample, as well as local explanations on the continuous predictor variables, are illustrated. Moreover, we show how this methodology can avoid the discrimination of sensitive groups through fairness constraints. Chapter 5 extends the methodology for regression in Chapter 4 to consider functional predictor variables instead. Simultaneously, the detection of a reduced number of intervals that are critical for prediction is performed. The sparsity in the proportion of the domain of the functional predictor variables to be used is also modeled through a regularization term added to the objective function. The obtained trade off between accuracy and sparsity is illustrated. Finally, Chapter 6 closes the thesis with general conclusions and future lines of research.Esta tesis combina las disciplinas de Investigación Operativa y Aprendizaje Automático a través del desarrollo de formulaciones de Optimización Matemática y algoritmos de resolución numérica para construir modelos basados en árboles de clasificación y regresión. A diferencia de los árboles de clasificación y regresión clásicos, generados de manera heurística y voraz, construir un árbol a través de un problema de optimización nos permite incluir fácilmente propiedades estructurales globales deseables. En esta tesis, ilustramos esta flexibilidad para modelar los siguientes aspectos: sparsity, como sinónimo de interpretabilidad, controlando el número de coeficientes no nulos, el número de variables predictoras y, si son funcionales, la proporción de dominio usado en la predicción; un criterio social importante, la equidad del modelo, evitando predicciones que discriminen a algunos individuos por su etnia u otras características sensibles; y la sensibilidad al coste de grupos de riesgo, asegurando un rendimiento aceptable para ellos. Además, con este enfoque se obtiene de manera natural el impacto que las variables predictoras continuas tienen en la predicción de cada individuo, mejorando así la explicabilidad local de los modelos de clasificación y regresión basados en árboles. Todos los enfoques propuestos en esta tesis se formulan a través de problemas de Optimización Continua que son escalables con respecto al tamaño de la muestra de entrenamiento, se estudian desde el punto de vista teórico, se evalúan en conjuntos de datos reales y son competitivos frente a los procedimientos habituales. Esto, junto a las buenas propiedades resumidas en el párrafo anterior, se ilustra a lo largo de los diferentes capítulos de esta tesis. La tesis se estructura de la siguiente manera. El estado del arte sobre árboles de decisión (óptimos) se discute ampliamente en el Capítulo 1, mientras que los cuatro capítulos siguientes exponen nuestra metodología. El Capítulo 2 introduce de forma detallada el marco general que hila los capítulos de esta tesis: un árbol aleatorizado con cortes oblicuos. En particular, presentamos nuestra propuesta para tratar problemas de clasificación, la cual construye la probabilidad de pertenencia a cada clase ajustada a cada individuo, a diferencia de las técnicas más populares existentes, en las que a todos los individuos en el mismo nodo hoja se les asigna la misma probabilidad. Se tratan con éxito preferencias en las tasas de clasificación en clases críticas mediante restricciones de sensibilidad al coste. El Capítulo 3 extiende la metodología de clasificación del Capítulo 2 para tratar adicionalmente sparsity. Esto se modela mediante regularizaciones con normas poliédricas que se añaden a la función objetivo. Se estudian propiedades teóricas del árbol más sparse, y se demuestra nuestra habilidad para sacrificar un poco de precisión en la clasificación por una ganancia en sparsity. En el Capítulo 4, los resultados obtenidos en los Capítulos 2 y 3 se adaptan para construir árboles sparse para regresión. Se exploran propiedades teóricas de las soluciones. Los experimentos numéricos demuestran la escalabilidad de nuestro enfoque con respecto al tamaño de la muestra de entrenamiento, y se ilustra cómo se generan las explicaciones locales en las variables predictoras continuas. Además, mostramos cómo esta metodología puede reducir la discriminación de grupos sensibles a través de las denominadas restricciones de justicia. El Capítulo 5 extiende la metodología de regresión del Capítulo 4 para considerar variables predictoras funcionales. De manera simultánea, la detección de un número reducido de intervalos que son críticos para la predicción es abordada. La sparsity en la proporción de dominio de las variables predictoras funcionales a usar se modela también a través de un término de regularización añadido a la función objetivo. De esta forma, se ilustra el equilibrio obtenido entre la precisión de predicción y la sparsity en este marco. Por último, el Capítulo 6 cierra la tesis con conclusiones generales y líneas futuras de investigación

    Tree-based ensembles unveil the microhabitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.): Introducing XGBoost to eco-informatics

    Full text link
    [EN] Random Forests (RFs) and Gradient Boosting Machines (GBMs) are popular approaches for habitat suitability modelling in environmental flow assessment. However, both present some limitations theoretically solved by alternative tree-based ensemble techniques (e.g. conditional RFs or oblique RFs). Among them, eXtreme Gradient Boosting machines (XGBoost) has proven to be another promising technique that mixes subroutines developed for RFs and GBMs. To inspect the capabilities of these alternative techniques, RFs and GBMs were compared with: conditional RFs, oblique RFs and XGBoost by modelling, at the micro-scale, the habitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L). XGBoost outperformed the other approaches, particularly conditional and oblique RFs, although there were no statistical differences with standard RFs and GBMs. The partial dependence plots highlighted the lacustrine origins of pumpkinseed and the preference for lentic habitats of bleak. However, the latter depicted a larger tolerance for rapid microhabitats found in run-type river segments, which is likely to hinder the management of flow regimes to control its invasion. The difference in the computational burden and, especially, the characteristics of datasets on microhabitat use (low data prevalence and high overlapping between categories) led us to conclude that, in the short term, XGBoost is not destined to replace properly optimised RFs and GBMs in the process of habitat suitability modelling at the micro-scale.This project had the support of Fundacion Biodiversidad, of Spanish Ministry for Ecological Transition. We want to thank the volunteering students of the Universitat Politecnica de Valencia, Marina de Miguel, Carlos A. Puig-Mengual, Cristina Barea, Rares Hugianu, and Pau Rodriguez. R. Munoz-Mas benefitted from a postdoctoral Juan de la Cierva fellowship from the Spanish Ministry of Science, Innovation and Universities (ref. FJCI-2016-30829). This research was supported by the Government of Catalonia (ref. 2017 SGR 548).Muñoz-Mas, R.; Gil-Martínez, E.; Oliva-Paterna, FJ.; Belda, E.; Martinez-Capel, F. (2019). Tree-based ensembles unveil the microhabitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.): Introducing XGBoost to eco-informatics. Ecological Informatics. 53:1-12. https://doi.org/10.1016/j.ecoinf.2019.100974S1125

    Data classification using genetic programming.

    Get PDF
    Master of Science in Computer Science.Genetic programming (GP), a field of artificial intelligence, is an evolutionary algorithm which evolves a population of trees which represent programs. These programs are used to solve problems. This dissertation investigates the use of genetic programming for data classification. In machine learning, data classification is the process of allocating a class label to an instance of data. A classifier is created in order to perform these allocations. Several studies have investigated the use of GP to solve data classification problems. These studies have shown that GP is able to create classifiers with high classification accuracies. However, there are certain aspects which have not previously been investigated. Five areas were investigated in this dissertation. The first was an investigation into how discretisation could be incorporated into a GP algorithm. An adaptive discretisation algorithm was proposed, and outperformed certain existing methods. The second was a comparison of GP representations for binary data classification. The findings indicated that from the representations examined (arithmetic trees, decision trees, and logical trees), the decision trees performed the best. The third was to investigate the use of the encapsulation genetic operator and its effect on data classification. The findings revealed that an improvement in both training and test results was achieved when encapsulation was incorporated. The fourth was an investigative analysis of several hybridisations of a GP algorithm with a genetic algorithm in order to evolve a population of ensembles. Four methods were proposed and these methods outperformed certain existing GP and ensemble methods. Finally, the fifth area was to investigate an ensemble construction method for classification. In this approach GP evolved a single ensemble. The proposed method resulted in an improvement in training and test accuracy when compared to the standard GP algorithm. The methods proposed in this dissertation were tested on publicly available data sets, and the results were statistically tested in order to determine the effectiveness of the proposed approaches

    Contributions to comprehensible classification

    Get PDF
    xxx, 240 p.La tesis doctoral descrita en esta memoria ha contribuido a la mejora de dos tipos de algoritmos declasificación comprensibles: algoritmos de \'arboles de decisión consolidados y algoritmos de inducciónde reglas tipo PART.En cuanto a las contribuciones a la consolidación de algoritmos de árboles de decisión, se hapropuesto una nueva estrategia de remuestreo que ajusta el número de submuestras para permitir cambiarla distribución de clases en las submuestras sin perder información. Utilizando esta estrategia, la versiónconsolidada de C4.5 (CTC) obtiene mejores resultados que un amplio conjunto de algoritmoscomprensibles basados en algoritmos genéticos y clásicos. Tres nuevos algoritmos han sido consolidados:una variante de CHAID (CHAID*) y las versiones Probability Estimation Tree de C4.5 y CHAID* (C4.4y CHAIC). Todos los algoritmos consolidados obtienen mejores resultados que sus algoritmos de\'arboles de decisión base, con tres algoritmos consolidados clasificándose entre los cuatro mejores en unacomparativa. Finalmente, se ha analizado el efecto de la poda en algoritmos simples y consolidados de\'arboles de decisión, y se ha concluido que la estrategia de poda propuesta en esta tesis es la que obtiene mejores resultados.En cuanto a las contribuciones a algoritmos tipo PART de inducción de reglas, una primerapropuesta cambia varios aspectos de como PART genera \'arboles parciales y extrae reglas de estos, locual resulta en clasificadores con mejor capacidad de generalizar y menor complejidad estructuralcomparando con los generados por PART. Una segunda propuesta utiliza \'arboles completamentedesarrollados, en vez de parcialmente desarrollados, y genera conjuntos de reglas que obtienen aúnmejores resultados de clasificación y una complejidad estructural menor. Estas dos nuevas propuestas y elalgoritmo PART original han sido complementadas con variantes basadas en CHAID* para observar siestos beneficios pueden ser trasladados a otros algoritmos de \'arboles de decisión y se ha observado, dehecho, que los algoritmos tipo PART basados en CHAID* también crean clasificadores más simples ycon mejor capacidad de clasificar que CHAID
    corecore