576 research outputs found

    On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems

    Full text link
    We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData Congress). arXiv admin note: text overlap with arXiv:1902.0935

    A new approach to fuzzy random forest generation

    Get PDF
    Random forests have proved to be very effective classifiers, which can achieve very high accuracies. Although a number of papers have discussed the use of fuzzy sets for coping with uncertain data in decision tree learning, fuzzy random forests have not been particularly investigated in the fuzzy community. In this paper, we first propose a simple method for generating fuzzy decision trees by creating fuzzy partitions for continuous variables during the learning phase. Then, we discuss how the method can be used for generating forests of fuzzy decision trees. Finally, we show how these fuzzy random forests achieve accuracies higher than two fuzzy rule-based classifiers recently proposed in the literature. Also, we highlight how fuzzy random forests are more tolerant to noise in datasets than classical crisp random forests

    On Distributed Fuzzy Decision Trees for Big Data

    Get PDF
    Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis. © 1993-2012 IEEE

    Global Entropy Based Greedy Algorithm for discretization

    Get PDF
    Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0

    A Survey of Neural Trees

    Full text link
    Neural networks (NNs) and decision trees (DTs) are both popular models of machine learning, yet coming with mutually exclusive advantages and limitations. To bring the best of the two worlds, a variety of approaches are proposed to integrate NNs and DTs explicitly or implicitly. In this survey, these approaches are organized in a school which we term as neural trees (NTs). This survey aims to present a comprehensive review of NTs and attempts to identify how they enhance the model interpretability. We first propose a thorough taxonomy of NTs that expresses the gradual integration and co-evolution of NNs and DTs. Afterward, we analyze NTs in terms of their interpretability and performance, and suggest possible solutions to the remaining challenges. Finally, this survey concludes with a discussion about other considerations like conditional computation and promising directions towards this field. A list of papers reviewed in this survey, along with their corresponding codes, is available at: https://github.com/zju-vipa/awesome-neural-treesComment: 35 pages, 7 figures and 1 tabl

    Development of Machine Learning Techniques for Diabetic Retinopathy Risk Estimation

    Get PDF
    La retinopatia diabètica (DR) és una malaltia crònica. És una de les principals complicacions de diabetis i una causa essencial de pèrdua de visió entre les persones que pateixen diabetis. Els pacients diabètics han de ser analitzats periòdicament per tal de detectar signes de desenvolupament de la retinopatia en una fase inicial. El cribratge precoç i freqüent disminueix el risc de pèrdua de visió i minimitza la càrrega als centres assistencials. El nombre dels pacients diabètics està en augment i creixements ràpids, de manera que el fa difícil que consumeix recursos per realitzar un cribatge anual a tots ells. L’objectiu principal d’aquest doctorat. la tesi consisteix en construir un sistema de suport de decisions clíniques (CDSS) basat en dades de registre de salut electrònic (EHR). S'utilitzarà aquest CDSS per estimar el risc de desenvolupar RD. En aquesta tesi doctoral s'estudien mètodes d'aprenentatge automàtic per constuir un CDSS basat en regles lingüístiques difuses. El coneixement expressat en aquest tipus de regles facilita que el metge sàpiga quines combindacions de les condicions són les poden provocar el risc de desenvolupar RD. En aquest treball, proposo un mètode per reduir la incertesa en la classificació dels pacients que utilitzen arbres de decisió difusos (FDT). A continuació es combinen diferents arbres, usant la tècnica de Fuzzy Random Forest per millorar la qualitat de la predicció. A continuació es proposen diverses tècniques d'agregació que millorin la fusió dels resultats que ens dóna cadascun dels arbres FDT. Per millorar la decisió final dels nostres models, proposo tres mesures difuses que s'utilitzen amb integrals de Choquet i Sugeno. La definició d’aquestes mesures difuses es basa en els valors de confiança de les regles. En particular, una d'elles és una mesura difusa que es troba en la qual l'estructura jeràrquica de la FDT és explotada per trobar els valors de la mesura difusa. El resultat final de la recerca feta ha donat lloc a un programari que es pot instal·lar en centres d’assistència primària i hospitals, i pot ser usat pels metges de capçalera per fer l'avaluació preventiva i el cribatge de la Retinopatia Diabètica.La retinopatía diabética (RD) es una enfermedad crónica. Es una de las principales complicaciones de diabetes y una causa esencial de pérdida de visión entre las personas que padecen diabetes. Los pacientes diabéticos deben ser examinados periódicamente para detectar signos de diabetes. desarrollo de retinopatía en una etapa temprana. La detección temprana y frecuente disminuye el riesgo de pérdida de visión y minimiza la carga en los centros de salud. El número de pacientes diabéticos es enorme y está aumentando rápidamente, lo que lo hace difícil y Consume recursos para realizar una evaluación anual para todos ellos. El objetivo principal de esta tesis es construir un sistema de apoyo a la decisión clínica (CDSS) basado en datos de registros de salud electrónicos (EHR). Este CDSS será utilizado para estimar el riesgo de desarrollar RD. En este tesis doctoral se estudian métodos de aprendizaje automático para construir un CDSS basado en reglas lingüísticas difusas. El conocimiento expresado en este tipo de reglas facilita que el médico pueda saber que combinaciones de las condiciones son las que pueden provocar el riesgo de desarrollar RD. En este trabajo propongo un método para reducir la incertidumbre en la clasificación de los pacientes que usan árboles de decisión difusos (FDT). A continuación se combinan diferentes árboles usando la técnica de Fuzzy Random Forest para mejorar la calidad de la predicción. Se proponen también varias políticas para fusionar los resultados de que nos da cada uno de los árboles (FDT). Para mejorar la decisión final propongo tres medidas difusas que se usan con las integrales Choquet y Sugeno. La definición de estas medidas difusas se basa en los valores de confianza de las reglas. En particular, uno de ellos es una medida difusa descomponible en la que se usa la estructura jerárquica del FDT para encontrar los valores de la medida difusa. Como resultado final de la investigación se ha construido un software que puede instalarse en centros de atención médica y hospitales, i que puede ser usado por los médicos de cabecera para hacer la evaluación preventiva y el cribado de la Retinopatía Diabética.Diabetic retinopathy (DR) is a chronic illness. It is one of the main complications of diabetes, and an essential cause of vision loss among people suffering from diabetes. Diabetic patients must be periodically screened in order to detect signs of diabetic retinopathy development in an early stage. Early and frequent screening decreases the risk of vision loss and minimizes the load on the health care centres. The number of the diabetic patients is huge and rapidly increasing so that makes it hard and resource-consuming to perform a yearly screening to all of them. The main goal of this Ph.D. thesis is to build a clinical decision support system (CDSS) based on electronic health record (EHR) data. This CDSS will be utilised to estimate the risk of developing RD. In this Ph.D. thesis, I focus on developing novel interpretable machine learning systems. Fuzzy based systems with linguistic terms are going to be proposed. The output of such systems makes the physician know what combinations of the features that can cause the risk of developing DR. In this work, I propose a method to reduce the uncertainty in classifying diabetic patients using fuzzy decision trees. A Fuzzy Random forest (FRF) approach is proposed as well to estimate the risk for developing DR. Several policies are going to be proposed to merge the classification results achieved by different Fuzzy Decision Trees (FDT) models to improve the quality of the final decision of our models, I propose three fuzzy measures that are used with Choquet and Sugeno integrals. The definition of these fuzzy measures is based on the confidence values of the rules. In particular, one of them is a decomposable fuzzy measure in which the hierarchical structure of the FDT is exploited to find the values of the fuzzy measure. Out of this Ph.D. work, we have built a CDSS software that may be installed in the health care centres and hospitals in order to evaluate and detect Diabetic Retinopathy at early stages
    • …
    corecore