171,031 research outputs found
Statistical Comparisons of the Top 10 Algorithms in Data Mining for Classification Task
This work is builds on the study of the 10 top data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) community in December 2006. We address the same study, but with the application of statistical tests to establish, a more appropriate and justified ranking classifier for classification tasks. Current studies and practices on theoretical and empirical comparison of several methods, approaches, advocated tests that are more appropriate. Thereby, recent studies recommend a set of simple and robust non-parametric tests for statistical comparisons classifiers. In this paper, we propose to perform non-parametric statistical tests by the Friedman test with post-hoc tests corresponding to the comparison of several classifiers on multiple data sets. The tests provide a better judge for the relevance of these algorithms
An automated machine learning approach for predicting chemical laboratory material consumption
This paper address a relevant business analytics need of a chemical company, which is adopting an Industry 4.0 transformation. In this company, quality tests are executed at the Analytical Laboratories (AL), which receive production samples and execute several instrumen- tal analyses. In order to improve the AL stock warehouse management, a Machine Learning (ML) project was developed, aiming to estimate the AL materials consumption based on week plans of sample analy- ses. Following the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology, several iterations were executed, in which three input variable selection strategies and two sets of AL materials (top 10 and all consumed materials) were tested. To reduce the mod- eling effort, an Automated Machine Learning (AutoML) was adopted, allowing to automatically set the best ML model among six distinct re- gression algorithms. Using real data from the chemical company and a realistic rolling window evaluation, several ML train and test iterations were executed. The AutoML results were compared with two time series forecasting methods, the ARIMA methodology and a deep learning Long Short-Term Memory (LSTM) model. Overall, competitive results were achieved by the best AutoML models, particularly for the top 10 set of materials.FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/202
Prediksi Churn dengan Algoritma Self Organizing Map\u27s Kohonen (SOM) dan Backpropagation (BP)
ABSTRAKSI: Churn prediction merupakan salah satu jenis task data mining, yaitu klasifikasi yang bertujuan untuk memprediksi pelanggan yang berpotensi untuk churn. Dalam tugas akhir ini digunakan dua metode Jaringan saraf tiruan untuk melakukan prediksi pelanggan yang churn. Algoritma pertama adalah algoritma Backpropagation, dimana algoritma ini mempunyai keakuratan prediksi yang tinggi. Algoritma kedua adalah algoritma Self Organizing Maps Kohonen, dimana algoritma ini merupakan algoritma yang bagus digunakan untuk klusterisasi data yang dapat dimanfaatkan untuk mengelompokkan data berdasarkan pola-pola data yang dipelajari. Berdasarkan fungsi dari masing-masing algoritma tersebut, pada tugas akhir ini akan algoritma SOM-BP dimana algoritma ini merupakan kombinasi dari kedua algoritma diatas. Data yang digunakan pada tugas akhir ini adalah sample data Tournament. Dalam Tugas Akhir ini akurasi yang dihasilkan diukur dengan tiga parameter yaitu lift curve, top decile dan f-measure. Untuk data undersampling, pengukuran lift curve terbaik 10% customer SOM bisa menebak 59% , pengukuran top decile SOM 10% customer sebesar 1.3 dan pada pengukuran f-measure terbaik yaitu SOM-BP dengan nilai 0.3991.Kata Kunci : Jaringan saraf tiruan, Algoritma Backpropagation Network, Algoritma Self Organizing Maps Kohonen, Prediksi Churn, Lift curve, Top Decile, f-measure.ABSTRACT: Churn prediction is one type of data mining tasks, namely classification which aims to predict the potential for customer churn. In this thesis used two methods of artificial neural networks to predict customer churn. The first algorithm is the Backpropagation algorithm, which algorithm has a high prediction accuracy. The second algorithm is an algorithm Kohonen Self Organizing Maps, where this algorithm is a good algorithm is used to klusterisasi data that can be used to classify data based on data patterns are studied. Based on the function of each of these algorithms, in this final SOM-BP algorithm where the algorithm is a combination of the two algorithms above. Data used in this thesis is the sample data Tournament. In this final accuracy of the resulting measured by three parameters namely the lift curve, the top decile and f-measure. For data undersampling, measurement of the lift curve is the best 10% customer can guess 59% SOM, measuring 10% of top decile SOM customer of 1.3 and the best measurement of f-measure of SOM-BP with a value of 0.3991.Keyword: Artificial Neural Network, Backpropagation Network Algorithm, Self Organizing Maps Kohonen Algorithm, Churn Prediction, , Lift curve, Top Decile, f-measure
Expert cancer model using supervised algorithms with a LASSO selection approach
One of the most critical issues of the mortality rate in the medical field in current times is breast cancer. Nowadays, a large number of men and women is facing cancer-related deaths due to the lack of early diagnosis systems and proper treatment per year. To tackle the issue, various data mining approaches have been analyzed to build an effective model that helps to identify the different stages of deadly cancers. The study successfully proposes an early cancer disease model based on five different supervised algorithms such as logistic regression (henceforth LR), decision tree (henceforth DT), random forest (henceforth RF), Support vector machine (henceforth SVM), and K-nearest neighbor (henceforth KNN). After an appropriate preprocessing of the dataset, least absolute shrinkage and selection operator (LASSO) was used for feature selection (FS) using a 10-fold cross-validation (CV) approach. Employing LASSO with 10-fold cross-validation has been a novel steps introduced in this research. Afterwards, different performance evaluation metrics were measured to show accurate predictions based on the proposed algorithms. The result indicated top accuracy was received from RF classifier, approximately 99.41% with the integration of LASSO. Finally, a comprehensive comparison was carried out on Wisconsin breast cancer (diagnostic) dataset (WBCD) together with some current works containing all features
Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining
The notion of meta-mining has appeared recently and extends the traditional
meta-learning in two ways. First it does not learn meta-models that provide
support only for the learning algorithm selection task but ones that support
the whole data-mining process. In addition it abandons the so called black-box
approach to algorithm description followed in meta-learning. Now in addition to
the datasets, algorithms also have descriptors, workflows as well. For the
latter two these descriptions are semantic, describing properties of the
algorithms. With the availability of descriptors both for datasets and data
mining workflows the traditional modelling techniques followed in
meta-learning, typically based on classification and regression algorithms, are
no longer appropriate. Instead we are faced with a problem the nature of which
is much more similar to the problems that appear in recommendation systems. The
most important meta-mining requirements are that suggestions should use only
datasets and workflows descriptors and the cold-start problem, e.g. providing
workflow suggestions for new datasets.
In this paper we take a different view on the meta-mining modelling problem
and treat it as a recommender problem. In order to account for the meta-mining
specificities we derive a novel metric-based-learning recommender approach. Our
method learns two homogeneous metrics, one in the dataset and one in the
workflow space, and a heterogeneous one in the dataset-workflow space. All
learned metrics reflect similarities established from the dataset-workflow
preference matrix. We demonstrate our method on meta-mining over biological
(microarray datasets) problems. The application of our method is not limited to
the meta-mining problem, its formulations is general enough so that it can be
applied on problems with similar requirements
- …