111,246 research outputs found

    Learning from the machine: interpreting machine learning algorithms for point- and extended- source classification

    Get PDF
    We investigate star-galaxy classification for astronomical surveys in the context of four methods enabling the interpretation of black-box machine learning systems. The first is outputting and exploring the decision boundaries as given by decision tree based methods, which enables the visualization of the classification categories. Secondly, we investigate how the Mutual Information based Transductive Feature Selection (MINT) algorithm can be used to perform feature pre-selection. If one would like to provide only a small number of input features to a machine learning classification algorithm, feature pre-selection provides a method to determine which of the many possible input properties should be selected. Third is the use of the tree-interpreter package to enable popular decision tree based ensemble methods to be opened, visualized, and understood. This is done by additional analysis of the tree based model, determining not only which features are important to the model, but how important a feature is for a particular classification given its value. Lastly, we use decision boundaries from the model to revise an already existing method of classification, essentially asking the tree based method where decision boundaries are best placed and defining a new classification method. We showcase these techniques by applying them to the problem of star-galaxy separation using data from the Sloan Digital Sky Survey (hereafter SDSS). We use the output of MINT and the ensemble methods to demonstrate how more complex decision boundaries improve star-galaxy classification accuracy over the standard SDSS frames approach (reducing misclassifications by up to ≈33%\approx33\%). We then show how tree-interpreter can be used to explore how relevant each photometric feature is when making a classification on an object by object basis.Comment: 12 pages, 8 figures, 8 table

    Integrating Information Theory Measures and a Novel Rule-Set-Reduction Tech-nique to Improve Fuzzy Decision Tree Induction Algorithms

    Get PDF
    Machine learning approaches have been successfully applied to many classification and prediction problems. One of the most popular machine learning approaches is decision trees. A main advantage of decision trees is the clarity of the decision model they produce. The ID3 algorithm proposed by Quinlan forms the basis for many of the decision trees’ application. Trees produced by ID3 are sensitive to small perturbations in training data. To overcome this problem and to handle data uncertainties and spurious precision in data, fuzzy ID3 integrated fuzzy set theory and ideas from fuzzy logic with ID3. Several fuzzy decision trees algorithms and tools exist. However, existing tools are slow, produce a large number of rules and/or lack the support for automatic fuzzification of input data. These limitations make those tools unsuitable for a variety of applications including those with many features and real time ones such as intrusion detection. In addition, the large number of rules produced by these tools renders the generated decision model un-interpretable. In this research work, we proposed an improved version of the fuzzy ID3 algorithm. We also introduced a new method for reducing the number of fuzzy rules generated by Fuzzy ID3. In addition we applied fuzzy decision trees to the classification of real and pseudo microRNA precursors. Our experimental results showed that our improved fuzzy ID3 can achieve better classification accuracy and is more efficient than the original fuzzy ID3 algorithm, and that fuzzy decision trees can outperform several existing machine learning algorithms on a wide variety of datasets. In addition our experiments showed that our developed fuzzy rule reduction method resulted in a significant reduction in the number of produced rules, consequently, improving the produced decision model comprehensibility and reducing the fuzzy decision tree execution time. This reduction in the number of rules was accompanied with a slight improvement in the classification accuracy of the resulting fuzzy decision tree. In addition, when applied to the microRNA prediction problem, fuzzy decision tree achieved better results than other machine learning approaches applied to the same problem including Random Forest, C4.5, SVM and Knn

    A new procedure for misbehavior detection in vehicular ad-hoc networks using machine learning

    Get PDF
    Misbehavior detection in vehicular ad hoc networks (VANETs) is performed to improve the traffic safety and driving accuracy. All the nodes in the VANETs communicate to each other through message logs. Malicious nodes in the VANETs can cause inevitable situation by sending message logs with tampered values. In this work, various machine learning algorithms are used to detect the primarily five types of attacks namely, constant attack, constant offset attack, random attack, random offset attack, and eventual attack. Firstly, each attack is detected by different machine learning algorithms using binary classification. Then, the new procedure is created to do the multi classification of the attacks on best chosen algorithm from different machine learning techniques. The highest accuracy in case of binary classification is obtained with Naïve Bayes (100%), decision tree (100%), and random forest (100%) in type1 attack, decision tree (100%) in type2 attack, and random forest (98.03%, 95.56%, and 95.55%) in Type4, Type8 and Type16 attack respectively. In case of new procedure for multi-classification, the highest accuracy is obtained with random forest (97.62%) technique. For this work, VeReMi dataset (a public repository for the malicious node detection in VANETs) is used

    A meta-heuristic approach for developing PROAFTN with decision tree

    Get PDF
    © 2016 IEEE. Machine learning algorithms known for their performance in using historical data and examples to predict and classify unknown instances. Decision tree is an efficient machine learning approach that can use data only without the involvement of decision maker to improve the decision making process. Multi-Criteria Decision Analysis (MCDA)is another paradigm used for data classification. In this paper, we propose a new fuzzy classification method based on MCDA called PROAFTN. To use PROAFTN, a set of parameters need to be established from data. The proposed approach uses data pre-processing and canonical genetic algorithm (GA) for obtaining these parameters from data. The generated models have been applied on popular data selected from several application domain, health, economy, etc. According to our experimental study, the new model performs significantly better than decision trees according in terms of accuracy and the interpretation of the decision rules

    Multi-objective optimisation of multi-output neural trees

    Get PDF
    We propose an algorithm and a new method to tackle the classification problems. We propose a multi-output neural tree (MONT) algorithm, which is an evolutionary learning algorithm trained by the non-dominated sorting genetic algorithm (NSGA)-III. Since evolutionary learning is stochastic, a hypothesis found in the form of MONT is unique for each run of evolutionary learning, i.e., each hypothesis (tree) generated bears distinct properties compared to any other hypothesis both in topological space and parameter-space. This leads to a challenging optimisation problem where the aim is to minimise the tree-size and maximise the classification accuracy. Therefore, the Pareto-optimality concerns were met by hypervolume indicator analysis. We used nine benchmark classification learning problems to evaluate the performance of the MONT. As a result of our experiments, we obtained MONTs which are able to tackle the classification problems with high accuracy. The performance of MONT emerged better over a set of problems tackled in this study compared with a set of well-known classifiers: multilayer perceptron, reduced-error pruning tree, naive Bayes classifier, decision tree, and support vector machine. Moreover, the performances of three versions of MONT’s training using genetic programming, NSGA-II, and NSGA-III suggests that the NSGA-III gives the best Pareto-optimal solution

    Predicting Customer Loyalty Using Machine Learning for Hotel Industry

    Get PDF
    The popularity of machine learning is growing and the demand for it is increasing in various fields including tourism and hospitality industry specifically hotels industry. The purpose of this research is to apply machine learning classification techniques to predict customers’ loyalty in hotel company so that hotel company can use the result to create possible solutions for customer relationship management. The experiment will be performed by implementing CRISP-DM methodology and three proposed algorithms such as decision tree, random forest and logistic regression and the result will be compared with each other to obtain the best algorithm among them by using confusion matrix. The dataset that will be used is obtained from Findbulous technology company. From the analysis result, logistic regression, decision tree and random forest algorithms generate 57.83%, 71.44% and 69.91% accuracy score respectively. For further improvement, this research approach can be used with other dataset or implement a new algorithm to identify each algorithm strengths and limitations

    Multispectral Image Analysis Using Random Forest

    Get PDF
    Classical methods for classification of pixels in multispectral images include supervised classifiers such as the maximum-likelihood classifier, neural network classifiers, fuzzy neural networks, support vector machines, and decision trees. Recently, there has been an increase of interest in ensemble learning – a method that generates many classifiers and aggregates their results. Breiman proposed Random Forestin 2001 for classification and clustering. Random Forest grows many decision trees for classification. To classify a new object, the input vector is run through each decision tree in the forest. Each tree gives a classification. The forest chooses the classification having the most votes. Random Forest provides a robust algorithm for classifying large datasets. The potential of Random Forest is not been explored in analyzing multispectral satellite images. To evaluate the performance of Random Forest, we classified multispectral images using various classifiers such as the maximum likelihood classifier, neural network, support vector machine (SVM), and Random Forest and compare their results

    Des-q: a quantum algorithm to construct and efficiently retrain decision trees for regression and binary classification

    Full text link
    Decision trees are widely used in machine learning due to their simplicity in construction and interpretability. However, as data sizes grow, traditional methods for constructing and retraining decision trees become increasingly slow, scaling polynomially with the number of training examples. In this work, we introduce a novel quantum algorithm, named Des-q, for constructing and retraining decision trees in regression and binary classification tasks. Assuming the data stream produces small increments of new training examples, we demonstrate that our Des-q algorithm significantly reduces the time required for tree retraining, achieving a poly-logarithmic time complexity in the number of training examples, even accounting for the time needed to load the new examples into quantum-accessible memory. Our approach involves building a decision tree algorithm to perform k-piecewise linear tree splits at each internal node. These splits simultaneously generate multiple hyperplanes, dividing the feature space into k distinct regions. To determine the k suitable anchor points for these splits, we develop an efficient quantum-supervised clustering method, building upon the q-means algorithm of Kerenidis et al. Des-q first efficiently estimates each feature weight using a novel quantum technique to estimate the Pearson correlation. Subsequently, we employ weighted distance estimation to cluster the training examples in k disjoint regions and then proceed to expand the tree using the same procedure. We benchmark the performance of the simulated version of our algorithm against the state-of-the-art classical decision tree for regression and binary classification on multiple data sets with numerical features. Further, we showcase that the proposed algorithm exhibits similar performance to the state-of-the-art decision tree while significantly speeding up the periodic tree retraining.Comment: 48 pager, 4 figures, 4 table
    • …
    corecore