25 research outputs found

    Incremental construction of classifier and discriminant ensembles

    Get PDF
    We discuss approaches to incrementally construct an ensemble. The first constructs an ensemble of classifiers choosing a subset from a larger set, and the second constructs an ensemble of discriminants, where a classifier is used for some classes only. We investigate criteria including accuracy, significant improvement, diversity, correlation, and the role of search direction. For discriminant ensembles, we test subset selection and trees. Fusion is by voting or by a linear model. Using 14 classifiers on 38 data sets. incremental search finds small, accurate ensembles in polynomial time. The discriminant ensemble uses a subset of discriminants and is simpler, interpretable, and accurate. We see that an incremental ensemble has higher accuracy than bagging and random subspace method; and it has a comparable accuracy to AdaBoost. but fewer classifiers.We would like to thank the three anonymous referees and the editor for their constructive comments, pointers to related literature, and pertinent questions which allowed us to better situate our work as well as organize the ms and improve the presentation. This work has been supported by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program (EA-TUBA-GEBIP/2001-1-1), Bogazici University Scientific Research Project 05HA101 and Turkish Scientific Technical Research Council TUBITAK EEEAG 104EO79Publisher's VersionAuthor Pre-Prin

    Differential Evolution Algorithm in the Construction of Interpretable Classification Models

    Get PDF
    In this chapter, the application of a differential evolution-based approach to induce oblique decision trees (DTs) is described. This type of decision trees uses a linear combination of attributes to build oblique hyperplanes dividing the instance space. Oblique decision trees are more compact and accurate than the traditional univariate decision trees. On the other hand, as differential evolution (DE) is an efficient evolutionary algorithm (EA) designed to solve optimization problems with real-valued parameters, and since finding an optimal hyperplane is a hard computing task, this metaheuristic (MH) is chosen to conduct an intelligent search of a near-optimal solution. Two methods are described in this chapter: one implementing a recursive partitioning strategy to find the most suitable oblique hyperplane of each internal node of a decision tree, and the other conducting a global search of a near-optimal oblique decision tree. A statistical analysis of the experimental results suggests that these methods show better performance as decision tree induction procedures in comparison with other supervised learning approaches

    Combined optimization algorithms applied to pattern classification

    Get PDF
    Accurate classification by minimizing the error on test samples is the main goal in pattern classification. Combinatorial optimization is a well-known method for solving minimization problems, however, only a few examples of classifiers axe described in the literature where combinatorial optimization is used in pattern classification. Recently, there has been a growing interest in combining classifiers and improving the consensus of results for a greater accuracy. In the light of the "No Ree Lunch Theorems", we analyse the combination of simulated annealing, a powerful combinatorial optimization method that produces high quality results, with the classical perceptron algorithm. This combination is called LSA machine. Our analysis aims at finding paradigms for problem-dependent parameter settings that ensure high classifica, tion results. Our computational experiments on a large number of benchmark problems lead to results that either outperform or axe at least competitive to results published in the literature. Apart from paxameter settings, our analysis focuses on a difficult problem in computation theory, namely the network complexity problem. The depth vs size problem of neural networks is one of the hardest problems in theoretical computing, with very little progress over the past decades. In order to investigate this problem, we introduce a new recursive learning method for training hidden layers in constant depth circuits. Our findings make contributions to a) the field of Machine Learning, as the proposed method is applicable in training feedforward neural networks, and to b) the field of circuit complexity by proposing an upper bound for the number of hidden units sufficient to achieve a high classification rate. One of the major findings of our research is that the size of the network can be bounded by the input size of the problem and an approximate upper bound of 8 + √2n/n threshold gates as being sufficient for a small error rate, where n := log/SL and SL is the training set

    Enhancing Classification and Regression Tree-Based Models by means of Mathematical Optimization

    Get PDF
    This PhD dissertation bridges the disciplines of Operations Research and Machine Learning by developing novel Mathematical Optimization formulations and numerical solution approaches to build classification and regression tree-based models. Contrary to classic classification and regression trees, built in a greedy heuristic manner, formulating the design of the tree model as an optimization problem allows us to easily include, either as hard or soft constraints, desirable global structural properties. In this PhD dissertation, we illustrate this flexibility to model: sparsity, as a proxy for interpretability, by controlling the number of non-zero coefficients, the number of predictor variables and, in the case of functional ones, the proportion of the domain used for prediction; an important social criterion, the fairness of the model, which aims to avoid predictions that discriminate against race, or other sensitive features; and the cost-sensitivity for groups at risk, by ensuring an acceptable accuracy performance for them. Moreover, we provide in a natural way the impact that continuous predictor variables have on each individual prediction, thus enhancing the local explainability of tree models. All the approaches proposed in this thesis are formulated through Continuous Optimization problems that are scalable with respect to the size of the training sample, are studied theoretically, are tested in real data sets and are competitive in terms of prediction accuracy against benchmarks. This, together with the good properties summarized above, is illustrated through the different chapters of this thesis. This PhD dissertation is organized as follows. The state of the art in the field of (optimal) decision trees is fully discussed in Chapter 1, while the next four chapters state our methodology. Chapter 2 introduces in detail the general framework that threads the chapters in this thesis: a randomized tree with oblique cuts. Particularly, we present our proposal to deal with classification problems, which naturally provides probabilistic output on class membership tailored to each individual, in contrast to the most popular existing approaches, where all individuals in the same leaf node are assigned the same probability. Preferences on classification rates in critical classes are successfully handled through cost-sensitive constraints. Chapter 3 extends the methodology for classification in Chapter 2 to additionally handle sparsity. This is modeled by means of regularizations with polyhedral norms added to the objective function. The sparsest tree case is theoretically studied. Our ability to easily trade in some of our classification accuracy for a gain in sparsity is shown. In Chapter 4, the findings obtained in Chapters 2 and 3 are adapted to construct sparse trees for regression. Theoretical properties of the solutions are explored. The scalability of our approach with respect to the size of the training sample, as well as local explanations on the continuous predictor variables, are illustrated. Moreover, we show how this methodology can avoid the discrimination of sensitive groups through fairness constraints. Chapter 5 extends the methodology for regression in Chapter 4 to consider functional predictor variables instead. Simultaneously, the detection of a reduced number of intervals that are critical for prediction is performed. The sparsity in the proportion of the domain of the functional predictor variables to be used is also modeled through a regularization term added to the objective function. The obtained trade off between accuracy and sparsity is illustrated. Finally, Chapter 6 closes the thesis with general conclusions and future lines of research.Esta tesis combina las disciplinas de Investigación Operativa y Aprendizaje Automático a través del desarrollo de formulaciones de Optimización Matemática y algoritmos de resolución numérica para construir modelos basados en árboles de clasificación y regresión. A diferencia de los árboles de clasificación y regresión clásicos, generados de manera heurística y voraz, construir un árbol a través de un problema de optimización nos permite incluir fácilmente propiedades estructurales globales deseables. En esta tesis, ilustramos esta flexibilidad para modelar los siguientes aspectos: sparsity, como sinónimo de interpretabilidad, controlando el número de coeficientes no nulos, el número de variables predictoras y, si son funcionales, la proporción de dominio usado en la predicción; un criterio social importante, la equidad del modelo, evitando predicciones que discriminen a algunos individuos por su etnia u otras características sensibles; y la sensibilidad al coste de grupos de riesgo, asegurando un rendimiento aceptable para ellos. Además, con este enfoque se obtiene de manera natural el impacto que las variables predictoras continuas tienen en la predicción de cada individuo, mejorando así la explicabilidad local de los modelos de clasificación y regresión basados en árboles. Todos los enfoques propuestos en esta tesis se formulan a través de problemas de Optimización Continua que son escalables con respecto al tamaño de la muestra de entrenamiento, se estudian desde el punto de vista teórico, se evalúan en conjuntos de datos reales y son competitivos frente a los procedimientos habituales. Esto, junto a las buenas propiedades resumidas en el párrafo anterior, se ilustra a lo largo de los diferentes capítulos de esta tesis. La tesis se estructura de la siguiente manera. El estado del arte sobre árboles de decisión (óptimos) se discute ampliamente en el Capítulo 1, mientras que los cuatro capítulos siguientes exponen nuestra metodología. El Capítulo 2 introduce de forma detallada el marco general que hila los capítulos de esta tesis: un árbol aleatorizado con cortes oblicuos. En particular, presentamos nuestra propuesta para tratar problemas de clasificación, la cual construye la probabilidad de pertenencia a cada clase ajustada a cada individuo, a diferencia de las técnicas más populares existentes, en las que a todos los individuos en el mismo nodo hoja se les asigna la misma probabilidad. Se tratan con éxito preferencias en las tasas de clasificación en clases críticas mediante restricciones de sensibilidad al coste. El Capítulo 3 extiende la metodología de clasificación del Capítulo 2 para tratar adicionalmente sparsity. Esto se modela mediante regularizaciones con normas poliédricas que se añaden a la función objetivo. Se estudian propiedades teóricas del árbol más sparse, y se demuestra nuestra habilidad para sacrificar un poco de precisión en la clasificación por una ganancia en sparsity. En el Capítulo 4, los resultados obtenidos en los Capítulos 2 y 3 se adaptan para construir árboles sparse para regresión. Se exploran propiedades teóricas de las soluciones. Los experimentos numéricos demuestran la escalabilidad de nuestro enfoque con respecto al tamaño de la muestra de entrenamiento, y se ilustra cómo se generan las explicaciones locales en las variables predictoras continuas. Además, mostramos cómo esta metodología puede reducir la discriminación de grupos sensibles a través de las denominadas restricciones de justicia. El Capítulo 5 extiende la metodología de regresión del Capítulo 4 para considerar variables predictoras funcionales. De manera simultánea, la detección de un número reducido de intervalos que son críticos para la predicción es abordada. La sparsity en la proporción de dominio de las variables predictoras funcionales a usar se modela también a través de un término de regularización añadido a la función objetivo. De esta forma, se ilustra el equilibrio obtenido entre la precisión de predicción y la sparsity en este marco. Por último, el Capítulo 6 cierra la tesis con conclusiones generales y líneas futuras de investigación

    K-means based clustering and context quantization

    Get PDF

    Entropy-based machine learning algorithms applied to genomics and pattern recognition

    Get PDF
    Transcription factors (TF) are proteins that interact with DNA to regulate the transcription of DNA to RNA and play key roles in both healthy and cancerous cells. Thus, gaining a deeper understanding of the biological factors underlying transcription factor (TF) binding specificity is important for understanding the mechanism of oncogenesis. As large, biological datasets become more readily available, machine learning (ML) algorithms have proven to make up an important and useful set of tools for cancer researchers. However, there remain many areas for potential improvements for these ML models, including a higher degree of model interpretability and overall accuracy. In this thesis, we present decision tree (DT) methods applied to DNA sequence analysis that result in highly interpretable and accurate predictions. We propose a boosted decision tree (BDT) model using the binary counts of important DNA motifs to predict the binding specificity of TFs belonging to the same protein family of binding similar DNA sequences. We then proceed to introduce a novel application of Convolutional Decision Trees (CDT) and demonstrate that this approach has distinct advantages over the BDT modeil while still accurately predicting the binding specificty of TFs. The CDT models are trained using the Cross Entropy (CE) optimization method, a Monte Carlo optimization method based on concepts from information theory related to statistical mechanics. We then further study the CDT model as a general pattern recognition and transfer learning technique and demonstrate that this approach can learn translationally invariant patterns that lead to high classification accuracy while remaining more interpretable and learning higher quality convolutional filters compared to convolutional neural networks (CNN)

    Llicenciatura de ciències i tècniques estadístiques

    Get PDF

    Automatic analysis of malaria infected red blood cell digitized microscope images

    Get PDF
    Malaria is one of the three most serious diseases worldwide, affecting millions each year, mainly in the tropics where the most serious illnesses are caused by Plasmodium falciparum. This thesis is concerned with the automatic analysis of images of microscope slides of Giemsa stained thin-films of such malaria infected blood so as to segment red-blood cells (RBCs) from the background plasma, to accurately and reliably count the cells, identify those that were infected with a parasite, and thus to determine the degree of infection or parasitemia. Unsupervised techniques were used throughout owing to the difficulty of obtaining large quantities of training data annotated by experts, in particular for total RBC counts. The first two aims were met by optimisation of Fisher discriminants. For RBC segmentation, a well-known iterative thresholding method due originally to Otsu (1979) was used for scalar features such as the image intensity and a novel extension of the algorithm developed for multi-dimensional, colour data. Performance of the algorithms was evaluated and compared via ROC analysis and their convergence properties studied. Ways of characterising the variability of the image data and, if necessary of mitigating it, were discussed in theory. The size distribution of the objects segmented in this way indicated that optimisation of a Fisher discriminant could be further used for classifying objects as small artefacts, singlet RBCs, doublets, or triplets etc. of adjoining cells provided optimisation was via a global search. Application of constraints on the relationships between the sizes of singlet and multiplet RBCs led to a number of tests that enabled clusters of cells to be reliably identified and accurate total RBC counts to be made. Development of an application to make such counts could be very useful both in research laboratories and in improving treatment of malaria. Unfortunately, the very small number of pixels belonging to parasite infections mean that it is difficult to segment parasite objects and thus to identify infected RBCs and to determine the parasitemia. Preliminary attempts to do so by similar, unsupervised means using Fischer discriminants, even when applied in a hierarchical manner, though suggestive that it may ultimately be possible to develop such a system remain on the evidence currently available, inconclusive. Appendices give details of material from old texts no longer easily accessible
    corecore