171,031 research outputs found

    Statistical Comparisons of the Top 10 Algorithms in Data Mining for Classification Task

    Get PDF
    This work is builds on the study of the 10 top data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) community in December 2006. We address the same study, but with the application of statistical tests to establish, a more appropriate and justified ranking classifier for classification tasks. Current studies and practices on theoretical and empirical comparison of several methods, approaches, advocated tests that are more appropriate. Thereby, recent studies recommend a set of simple and robust non-parametric tests for statistical comparisons classifiers. In this paper, we propose to perform non-parametric statistical tests by the Friedman test with post-hoc tests corresponding to the comparison of several classifiers on multiple data sets. The tests provide a better judge for the relevance of these algorithms

    An automated machine learning approach for predicting chemical laboratory material consumption

    Get PDF
    This paper address a relevant business analytics need of a chemical company, which is adopting an Industry 4.0 transformation. In this company, quality tests are executed at the Analytical Laboratories (AL), which receive production samples and execute several instrumen- tal analyses. In order to improve the AL stock warehouse management, a Machine Learning (ML) project was developed, aiming to estimate the AL materials consumption based on week plans of sample analy- ses. Following the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology, several iterations were executed, in which three input variable selection strategies and two sets of AL materials (top 10 and all consumed materials) were tested. To reduce the mod- eling effort, an Automated Machine Learning (AutoML) was adopted, allowing to automatically set the best ML model among six distinct re- gression algorithms. Using real data from the chemical company and a realistic rolling window evaluation, several ML train and test iterations were executed. The AutoML results were compared with two time series forecasting methods, the ARIMA methodology and a deep learning Long Short-Term Memory (LSTM) model. Overall, competitive results were achieved by the best AutoML models, particularly for the top 10 set of materials.FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/202

    Prediksi Churn dengan Algoritma Self Organizing Map\u27s Kohonen (SOM) dan Backpropagation (BP)

    Get PDF
    ABSTRAKSI: Churn prediction merupakan salah satu jenis task data mining, yaitu klasifikasi yang bertujuan untuk memprediksi pelanggan yang berpotensi untuk churn. Dalam tugas akhir ini digunakan dua metode Jaringan saraf tiruan untuk melakukan prediksi pelanggan yang churn. Algoritma pertama adalah algoritma Backpropagation, dimana algoritma ini mempunyai keakuratan prediksi yang tinggi. Algoritma kedua adalah algoritma Self Organizing Maps Kohonen, dimana algoritma ini merupakan algoritma yang bagus digunakan untuk klusterisasi data yang dapat dimanfaatkan untuk mengelompokkan data berdasarkan pola-pola data yang dipelajari. Berdasarkan fungsi dari masing-masing algoritma tersebut, pada tugas akhir ini akan algoritma SOM-BP dimana algoritma ini merupakan kombinasi dari kedua algoritma diatas. Data yang digunakan pada tugas akhir ini adalah sample data Tournament. Dalam Tugas Akhir ini akurasi yang dihasilkan diukur dengan tiga parameter yaitu lift curve, top decile dan f-measure. Untuk data undersampling, pengukuran lift curve terbaik 10% customer SOM bisa menebak 59% , pengukuran top decile SOM 10% customer sebesar 1.3 dan pada pengukuran f-measure terbaik yaitu SOM-BP dengan nilai 0.3991.Kata Kunci : Jaringan saraf tiruan, Algoritma Backpropagation Network, Algoritma Self Organizing Maps Kohonen, Prediksi Churn, Lift curve, Top Decile, f-measure.ABSTRACT: Churn prediction is one type of data mining tasks, namely classification which aims to predict the potential for customer churn. In this thesis used two methods of artificial neural networks to predict customer churn. The first algorithm is the Backpropagation algorithm, which algorithm has a high prediction accuracy. The second algorithm is an algorithm Kohonen Self Organizing Maps, where this algorithm is a good algorithm is used to klusterisasi data that can be used to classify data based on data patterns are studied. Based on the function of each of these algorithms, in this final SOM-BP algorithm where the algorithm is a combination of the two algorithms above. Data used in this thesis is the sample data Tournament. In this final accuracy of the resulting measured by three parameters namely the lift curve, the top decile and f-measure. For data undersampling, measurement of the lift curve is the best 10% customer can guess 59% SOM, measuring 10% of top decile SOM customer of 1.3 and the best measurement of f-measure of SOM-BP with a value of 0.3991.Keyword: Artificial Neural Network, Backpropagation Network Algorithm, Self Organizing Maps Kohonen Algorithm, Churn Prediction, , Lift curve, Top Decile, f-measure

    Expert cancer model using supervised algorithms with a LASSO selection approach

    Get PDF
    One of the most critical issues of the mortality rate in the medical field in current times is breast cancer. Nowadays, a large number of men and women is facing cancer-related deaths due to the lack of early diagnosis systems and proper treatment per year. To tackle the issue, various data mining approaches have been analyzed to build an effective model that helps to identify the different stages of deadly cancers. The study successfully proposes an early cancer disease model based on five different supervised algorithms such as logistic regression (henceforth LR), decision tree (henceforth DT), random forest (henceforth RF), Support vector machine (henceforth SVM), and K-nearest neighbor (henceforth KNN). After an appropriate preprocessing of the dataset, least absolute shrinkage and selection operator (LASSO) was used for feature selection (FS) using a 10-fold cross-validation (CV) approach. Employing LASSO with 10-fold cross-validation has been a novel steps introduced in this research. Afterwards, different performance evaluation metrics were measured to show accurate predictions based on the proposed algorithms. The result indicated top accuracy was received from RF classifier, approximately 99.41% with the integration of LASSO. Finally, a comprehensive comparison was carried out on Wisconsin breast cancer (diagnostic) dataset (WBCD) together with some current works containing all features

    Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining

    Get PDF
    The notion of meta-mining has appeared recently and extends the traditional meta-learning in two ways. First it does not learn meta-models that provide support only for the learning algorithm selection task but ones that support the whole data-mining process. In addition it abandons the so called black-box approach to algorithm description followed in meta-learning. Now in addition to the datasets, algorithms also have descriptors, workflows as well. For the latter two these descriptions are semantic, describing properties of the algorithms. With the availability of descriptors both for datasets and data mining workflows the traditional modelling techniques followed in meta-learning, typically based on classification and regression algorithms, are no longer appropriate. Instead we are faced with a problem the nature of which is much more similar to the problems that appear in recommendation systems. The most important meta-mining requirements are that suggestions should use only datasets and workflows descriptors and the cold-start problem, e.g. providing workflow suggestions for new datasets. In this paper we take a different view on the meta-mining modelling problem and treat it as a recommender problem. In order to account for the meta-mining specificities we derive a novel metric-based-learning recommender approach. Our method learns two homogeneous metrics, one in the dataset and one in the workflow space, and a heterogeneous one in the dataset-workflow space. All learned metrics reflect similarities established from the dataset-workflow preference matrix. We demonstrate our method on meta-mining over biological (microarray datasets) problems. The application of our method is not limited to the meta-mining problem, its formulations is general enough so that it can be applied on problems with similar requirements
    • …
    corecore