142,716 research outputs found

    Statistical Comparisons of the Top 10 Algorithms in Data Mining for Classification Task

    Get PDF
    This work is builds on the study of the 10 top data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) community in December 2006. We address the same study, but with the application of statistical tests to establish, a more appropriate and justified ranking classifier for classification tasks. Current studies and practices on theoretical and empirical comparison of several methods, approaches, advocated tests that are more appropriate. Thereby, recent studies recommend a set of simple and robust non-parametric tests for statistical comparisons classifiers. In this paper, we propose to perform non-parametric statistical tests by the Friedman test with post-hoc tests corresponding to the comparison of several classifiers on multiple data sets. The tests provide a better judge for the relevance of these algorithms

    Impacts of frequent itemset hiding algorithms on privacy preserving data mining

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2010Includes bibliographical references (leaves: 54-58)Text in English; Abstract: Turkish and Englishx, 69 leavesThe invincible growing of computer capabilities and collection of large amounts of data in recent years, make data mining a popular analysis tool. Association rules (frequent itemsets), classification and clustering are main methods used in data mining research. The first part of this thesis is implementation and comparison of two frequent itemset mining algorithms that work without candidate itemset generation: Matrix Apriori and FP-Growth. Comparison of these algorithms revealed that Matrix Apriori has higher performance with its faster data structure. One of the great challenges of data mining is finding hidden patterns without violating data owners. privacy. Privacy preserving data mining came into prominence as a solution. In the second study of the thesis, Matrix Apriori algorithm is modified and a frequent itemset hiding framework is developed. Four frequent itemset hiding algorithms are proposed such that: i) all versions work without pre-mining so privacy breech caused by the knowledge obtained by finding frequent itemsets is prevented in advance, ii) efficiency is increased since no pre-mining is required, iii) supports are found during hiding process and at the end sanitized dataset and frequent itemsets of this dataset are given as outputs so no post-mining is required, iv) the heuristics use pattern lengths rather than transaction lengths eliminating the possibility of distorting more valuable data

    Estudio del comportamiento del Algoritmo K* en bases e datos internacionales

    Get PDF
    This paper presents an experimental study of K* algorithm, which was compared with five classification algorithms of the top ten data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM), which are C4.5, SVM, kNN, Naive Bayes and CART. The experimental results show a satisfactory performance of K* algorithm in comparison with these approaches.Este trabajo presenta un estudio experimental del algoritmo K*, el cual se comparó con cinco algoritmos de clasificación de los diez principales algoritmos de minería de datos identificados en la Conferencia Internacional IEEE sobre Minería de Datos (ICDM), los cuales son C4.5, SVM, kNN, Naive Bayes y CART. Los resultados experimentales muestran un rendimiento satisfactorio del algoritmo K* en comparación con estos enfoques

    Pattern Classification using Artificial Neural Networks

    Get PDF
    Classification is a data mining (machine learning) technique used to predict group membership for data instances. Pattern Classification involves building a function that maps the input feature space to an output space of two or more than two classes.Neural Networks (NN) are an effective tool in the field of pattern classification, using training and testing data to build a model. However, the success of the networks is highly dependent on the performance of the training process and hence the training algorithm. Many training algorithms have been proposed so far to improve the performance of neural networks. In this project, we shall make a comparative study of training feedforward neural network using the three algorithms - Backpropagation Algorithm, Modified Backpropagation Algorithm and Optical Backpropagation Algorithm. These algorithms differ only on the basis of their error functions.We shall train the neural networks using these algorithms and taking 75 instances from the iris dataset (taken from the UCI repository and then normalised) ; 25 from each class. The total number of epochs required to reach the degree of accuracy is referred to as the convergence rate. The basic criteria of comparison process are the convergence rate and the classification accuracy. To check the efficiency of the three training algorithms, graphs are plotted between No. of Epochs vs. Mean Square Error(MSE). The training process continues till M.S.E falls to a value 0.01. The effect of using the momentum and learning rate on the performance of algorithm are also observed. The comparison is then extended to compare the performance of multilayer feedforward network with Probabilistic network

    PERBANDINGAN ALGORITMA C4.5 DAN ALGORITMA NAÏVE BAYES PADA DATA KELULUSAN MAHASISWA

    Get PDF
    Musi Charitas Catholic University (UKMC) is a private university in Palembang. Accreditation is one form or method used to assess the feasibility of a university with predetermined accreditation criteria. One of the criteria for accreditation assessment is the evaluation of the length of study of graduating students. It is important for universities to analyze whether students who have completed the study period have passed on time. Data mining technique is one method that can be used for the classification process. There are five data mining methods: estimation, prediction, classification and association. Classification was carried out using the C4.5 and Naive Bayes algorithms, using the attributes of gender, gender, age, address and GPA. From the comparison results of the C4.5 and Naive Bayes algorithms, it is found that the C4.5 algorithm is better than Naive Bayes. Where the C4.5 algorithm has an accuracy value of 92.43%, while the Naive Bayes algorithm is 91.12% the difference is 1.31%. In addition to the accuracy obtained, precision and recall values are also obtained, where the C4.5 precision is 98.86% and 95.76% for naive Bayes. The recall value for C4.5 is 93.09% and 94.37% for Naive Bayes. Keywords: Data Mining ; Classification ; C4.5 Algorithm ; Naïve Bayes Algorithm ; performanc

    Analysis and Prediction of Student Performance by Using A Hybrid Optimized BFO-ALO Based Approach: Student Performance Prediction using Hybrid Approach

    Get PDF
    Data mining offers effective solutions for a variety of industries, including education. Research in the subject of education is expanding rapidly because of thebigquantityof student data that can be utilized to uncover valuable learning behavior patterns. This research presents a method for forecasting the academic presentation of students in Portuguese as well as math subjects, and it is describing with the help of  33 attributes. Forecasting the educationalattainment of students is the most popular field of study in the modern period. Previous research has employed a variety of categorization algorithms to forecast student performance. Educational data mining is a topic that needs a lot of research to improve the precision of the classification technique and predict how well students will do in school. In this study, we made a method to predict how well a student will do that uses a mix of optimization techniques. BFO and ALO-based popular optimization techniques were applied to the data set. Python was used to process all the files and conduct a performance comparison analysis. In this study, we compared our model's performance with various existing baseline models and examined the accuracy with which the hybrid algorithm predicted the student data set. To verify the expected classification accuracy, a calculation was performed. The experiment's findings indicate that the BFO-ALO Based hybrid model, which, out of all the methods, with a 94.5 percent success rate, is the preferred choice

    Feature Selection and Classification Methods for Decision Making: A Comparative Analysis

    Get PDF
    The use of data mining methods in corporate decision making has been increasing in the past decades. Its popularity can be attributed to better utilizing data mining algorithms, increased performance in computers, and results which can be measured and applied for decision making. The effective use of data mining methods to analyze various types of data has shown great advantages in various application domains. While some data sets need little preparation to be mined, whereas others, in particular high-dimensional data sets, need to be preprocessed in order to be mined due to the complexity and inefficiency in mining high dimensional data processing. Feature selection or attribute selection is one of the techniques used for dimensionality reduction. Previous research has shown that data mining results can be improved in terms of accuracy and efficacy by selecting the attributes with most significance. This study analyzes vehicle service and sales data from multiple car dealerships. The purpose of this study is to find a model that better classifies existing customers as new car buyers based on their vehicle service histories. Six different feature selection methods such as; Information Gain, Correlation Based Feature Selection, Relief-F, Wrapper, and Hybrid methods, were used to reduce the number of attributes in the data sets are compared. The data sets with the attributes selected were run through three popular classification algorithms, Decision Trees, k-Nearest Neighbor, and Support Vector Machines, and the results compared and analyzed. This study concludes with a comparative analysis of feature selection methods and their effects on different classification algorithms within the domain. As a base of comparison, the same procedures were run on a standard data set from the financial institution domain

    OPTIMIZATION OF LUNG CANCER CLASSIFICATION METHOD USING EDA-BASED MACHINE LEARNING

    Get PDF
    Lung cancer is one of the three deadliest diseases in the world and has rapidly developed. Based on this, researchers conducted research to predict the factors that influence lung cancer. One method to identify this is using data mining methods and classification techniques. Researchers used several popular algorithms in classification to make comparisons of the most accurate algorithms for lung cancer classification. The algorithms used include K-Nearest Neighbor, Random Forest Classifier, Logistic Regression, Linear SVM, Naïve Bayes, Decision Tree, Random Forest, Gradient Boosting, Kernel SVM, and MLPClassifier. The researcher used this algorithm because, in the research that the researcher found on the Kaggle platform, he examined the comparison of the algorithm using the breast cancer dataset. In previous studies, their researchers used SVM, which obtained an accuracy of 96.47%, Neural Networks of 97.06%, and Naïve Bayes with an accuracy of 91.18% to study breast cancer. The difference from previous research is that this study uses several existing algorithms in Machine Learning such as K-Nearest Neighbor, Random Forest Classifier, Logistic Regression, Linear SVM, Naïve Bayes, Decision Tree, Random Forest, Gradient Boosting, Kernel SVM, and MLPClassifier. In addition, this research was conducted to see whether the results of the accuracy of the algorithm that the researchers carried out using the lung cancer dataset had different results. The results of this study found that the more accurate algorithms were Random Forest and Gradient Boosting with an accuracy value of 100%, whereas in previous studies, it was the same. Still, Gradient Boosting had a higher accuracy value than Random Forest. Then, based on the data used in this study, the most influencing factors in predicting a diagnosis of lung cancer are obesity and coughing up blood. The results of this study found that the more accurate algorithms were Random Forest and Gradient Boosting with an accuracy value of 100%, whereas in previous studies, it was the same. Still, Gradient Boosting had a higher accuracy value than Random Forest. Then, based on the data used in this study, the most influencing factors in predicting a diagnosis of lung cancer are obesity and coughing up blood. The results of this study found that the more accurate algorithms were Random Forest and Gradient Boosting with an accuracy value of 100%, whereas in previous studies, it was the same. Still, Gradient Boosting had a higher accuracy value than Random Forest. Then, based on the data used in this study, the most influencing factors in predicting a diagnosis of lung cancer are obesity and coughing up blood.
    corecore