37,088 research outputs found

    On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems

    Full text link
    We present a new distributed fuzzy partitioning method to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems. The proposed algorithm builds a fixed number of fuzzy sets for all variables and adjusts their shape and position to the real distribution of training data. A two-step process is applied : 1) transformation of the original distribution into a standard uniform distribution by means of the probability integral transform. Since the original distribution is generally unknown, the cumulative distribution function is approximated by computing the q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy partition in the transformed attribute space using a fixed number of equally distributed triangular membership functions. Despite the aforementioned transformation, the definition of every fuzzy set in the original space can be recovered by applying the inverse cumulative distribution function (also known as quantile function). The experimental results reveal that the proposed methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT) induction algorithm to maintain classification accuracy with up to 6 million fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData Congress). arXiv admin note: text overlap with arXiv:1902.0935

    Decision diagrams in machine learning: an empirical study on real-life credit-risk data.

    Get PDF
    Decision trees are a widely used knowledge representation in machine learning. However, one of their main drawbacks is the inherent replication of isomorphic subtrees, as a result of which the produced classifiers might become too large to be comprehensible by the human experts that have to validate them. Alternatively, decision diagrams, a generalization of decision trees taking on the form of a rooted, acyclic digraph instead of a tree, have occasionally been suggested as a potentially more compact representation. Their application in machine learning has nonetheless been criticized, because the theoretical size advantages of subgraph sharing did not always directly materialize in the relatively scarce reported experiments on real-world data. Therefore, in this paper, starting from a series of rule sets extracted from three real-life credit-scoring data sets, we will empirically assess to what extent decision diagrams are able to provide a compact visual description. Furthermore, we will investigate the practical impact of finding a good attribute ordering on the achieved size savings.Advantages; Classifiers; Credit scoring; Data; Decision; Decision diagrams; Decision trees; Empirical study; Knowledge; Learning; Real life; Representation; Size; Studies;

    Online Tool Condition Monitoring Based on Parsimonious Ensemble+

    Full text link
    Accurate diagnosis of tool wear in metal turning process remains an open challenge for both scientists and industrial practitioners because of inhomogeneities in workpiece material, nonstationary machining settings to suit production requirements, and nonlinear relations between measured variables and tool wear. Common methodologies for tool condition monitoring still rely on batch approaches which cannot cope with a fast sampling rate of metal cutting process. Furthermore they require a retraining process to be completed from scratch when dealing with a new set of machining parameters. This paper presents an online tool condition monitoring approach based on Parsimonious Ensemble+, pENsemble+. The unique feature of pENsemble+ lies in its highly flexible principle where both ensemble structure and base-classifier structure can automatically grow and shrink on the fly based on the characteristics of data streams. Moreover, the online feature selection scenario is integrated to actively sample relevant input attributes. The paper presents advancement of a newly developed ensemble learning algorithm, pENsemble+, where online active learning scenario is incorporated to reduce operator labelling effort. The ensemble merging scenario is proposed which allows reduction of ensemble complexity while retaining its diversity. Experimental studies utilising real-world manufacturing data streams and comparisons with well known algorithms were carried out. Furthermore, the efficacy of pENsemble was examined using benchmark concept drift data streams. It has been found that pENsemble+ incurs low structural complexity and results in a significant reduction of operator labelling effort.Comment: this paper has been published by IEEE Transactions on Cybernetic
    • …
    corecore