6 research outputs found
Integration of Data Mining and Data Warehousing: a practical methodology
The ever growing repository of data in all fields poses new challenges to the modern analytical
systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has
to be modeled in form of a data warehouse schema. Schema generation process is complex manual
task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful
knowledge from large datasets
Pemilihan Fitur Untuk Klasifikasi Loyalitas Pelanggan Terhadap Merek Produkfast Moving Consumer Goods (Studi Kasus: Mie Instan)
Pemilihan fitur merupakan salah satu bagian penting dan teknik yang
sering digunakan dalam praproses penggalian data yang membawa efek langsung
untuk mempercepat algoritma penggalian data dan meningkatkan kinerja
pertambangan seperti akurasi prediksi dan hasil yang komprehensif. Penelitian ini
membahas mengenai pemilihan subset fitur dalam klasifikasi loyalitas pelanggan
terhadap merek bagi pengguna fast moving consumer goods(dalam penelitian ini
mengambil studi kasus pada salah satu produknya yaitu mie instan) dan
melakukan analisis terhadap fitur-fitur yang mempengaruhi performa klasifikasi
pohon keputusan.
Data yang digunakan pada penelitian ini merupakan hasil penyebaran
kuisoner kepada para pelanggan mie instan di Propinsi Lampung. Data yang
diperoleh memiliki fitur yang bersifat heterogen, untuk itu dilakukan pengubahan
fitur menjadi fitur homogen. Dalam penelitian ini, mengkombinasikan metode
UFT (unsupervised feature transformation) dan metode DMI (dynamic mutual
information) untuk seleksi fitur. Metode UFT digunakan untuk transformasi fitur
non-numerik menjadi fitur numerik, sehingga fitur yang bersifat heterogen
menjadi fitur homogen. Metode DMI digunakan untuk pemilihan fitur. Hasil
transformasi fitur diklasifikasikan menggunakan algoritmapohon keputusan. Hasil
klasifikasi digunakan untukmelakukan perbandingan performa antara dataset
sebelum pemilihan fitur, setelah dilakukanpemilihan fitur menggunakan metode
DMI, p-Value dan perkiraan peneliti.
Dari hasil pengujian terhadap model prediksi klasifikasi diperoleh fiturfitur
yang mempengaruhi performa klasifikasi pohon keputusanloyalitas
pelanggan. Peningkatan performa tersebut dapat dilihat pada pengimplementasian
metode pemilihan fitur DMIdengan jumlah fitur sebanyak lima. Nilai akurasi,
presisi, recall dan f-measure mengalami peningkatan bila dibandingkan dengan
penggunaan seluruh fitur (sebelum dilakukan pemilihan fitur), metode pemilihan
fitur p-value dan hasil perkiraan, masing-masing nilai tersebut secara berturutturut
adalah sebesar 76.68%, 74.4%, 76.7% dan 73.5%.Fitur-fitur yang
berpengaruh tersebut antara lain jumlah pengeluaran, rata-rata konsumsi, usia,
alamat dan alasan berpindah merek.
===========================================================
Feature selection is one of the important parts and techniques used in data
mining preprocess to bring immediate effect in accelerate the data mining
algorithms and improve the performance of mining such as the prediction
accuracy and comprehensive results. This study discusses the subset features
selection in the classification of customer loyalty to the brand for the fast moving
consumer goods (this study took a case study on one of its products, i.e instant
noodles) and an analysis of the features that affect the performance classification
of decision tree.
The used data in this study is the result of spread questionnaires to
customers instant noodles in Lampung Province. The obtained data has a
heterogeneous features, it is neededto carried out the transformation of features
into a homogeneous features. In this study, we combine UFT (unsupervised
feature transformation) and DMI (dynamic mutual information)methods for
features selection. UFT methods used for transformation of non-numerical
features into a numerical features, so heterogeneous features became
homogeneous features. DMI methods used for feature selection.Feature
transformation result is classified using decision trees algorithm. The results of
classification is used to performance comparisons between the datasets before the
feature selection, after the feature selection using DMI, p-Value and researchers
estimate. The test results of the predictive models of classification obtained the
features that affect the decision tree algorithm performance of customer loyalty.
The performance enhancement can be seen in the implementation of the DMI
feature selection method with a number of features as many as five features.
Value of accuracy, precision, recall and F-measure increased when compared to
the use of all features (prior to the selection of features), methods of feature
selection p-value and methods of researcher's estimate, respectively of values is
76.68%, 74.4 %, 76.7% and 73.5%. The features that affect the performance of
classification, ie expenditures, average of consumption, age of costumer, address
and the reason for switching brands
Multivariate Correlation Analysis for Supervised Feature Selection in High-Dimensional Data
The main theme of this dissertation focuses on multivariate correlation analysis on different data types and we identify and define various research gaps in the same. For the defined research gaps we develop novel techniques that address relevance of features to the target and redundancy of features amidst themselves. Our techniques aim at handling homogeneous data, i.e., only continuous or categorical features, mixed data, i.e., continuous and categorical features, and time serie
Variable selection for classification in complex ophthalmic data: a multivariate statistical framework
Variable selection is an essential part of the process of model-building for classification or prediction. Some of the challenges of variable selection are heterogeneous variance-covariance matrices, differing scales of variables, non-normally distributed data and missing data. Statistical methods exist for variable selection however these are often univariate, make restrictive assumptions about the distribution of data or are expensive in terms of the computational power required. In this thesis I focus on filter methods of variable selection that are computationally fast and propose a metric of discrimination. The main objectives of this thesis are (1) to propose a novel Signal-to-Noise Ratio (SNR) discrimination metric accommodating heterogeneous variance-covariance matrices, (2) to develop a multiple forward selection (MFS) algorithm employing the novel SNR metric, (3) to assess the performance of the MFS-SNR algorithm compared to alternative methods of variable selection, (4) to investigate the ability of the MFS-SNR algorithm to carry out variable selection when data are not normally distributed and (5) to apply the MFS-SNR algorithm to the task of variable selection from real datasets. The MFS-SNR algorithm was implemented in the R programming environment. It calculates the SNR for subsets of variables, identifying the optimal variable during each round of selection as whichever causes the largest increase in SNR. A dataset was simulated comprising 10 variables: 2 discriminating variables, 7 non-discriminating variables and one non-discriminating variable which enhanced the discriminatory performance of other variables. In simulations the frequency of each variable’s selection was recorded. The probability of correct classification (PCC) and area under the curve (AUC) were calculated for sets of selected variables. I assessed the ability of the MFS-SNR algorithm to select variables when data are not normally distributed using simulated data. I compared the MFS-SNR algorithm to filter methods utilising information gain, chi-square statistics and the Relief-F algorithm as well as a support vector machines and an embedded method using random forests. A version of the MFS algorithm utilising Hotelling’s T2 statistic (MFS-T2) was included in this comparison. The MFS-SNR algorithm selected all 3 variables relevant to discrimination with higher or equivalent frequencies to competing methods in all scenarios. Following non-normal variable transformation the MFS-SNR algorithm still selected the variables known to be relevant to discrimination in the simulated scenarios. Finally, I studied both the MFS-SNR and MFS-T2 algorithm’s ability to carry out variable selection for disease classification using several clinical datasets from ophthalmology. These datasets represented a spectrum of quality issues such as missingness, imbalanced group sizes, heterogeneous variance-covariance matrices and differing variable scales. In 3 out of 4 datasets the MFS-SNR algorithm out-performed the MFS-T2 algorithm. In the fourth study both MFS-T2 and MFS-SNR produced the same variable selection results. In conclusion I have demonstrated that the novel SNR is an extension of Hotelling’s T2 statistic accommodating heterogeneity of variance-covariance matrices. The MFS-SNR algorithm is capable of selecting the relevant variables whether data are normally distributed or not. In the simulated scenarios the MFS-SNR algorithm performs at least as well as competing methods and outperforms the MFS-T2 algorithm when selecting variables from real clinical datasets