6 research outputs found

    Integration of Data Mining and Data Warehousing: a practical methodology

    Get PDF
    The ever growing repository of data in all fields poses new challenges to the modern analytical systems. Real-world datasets, with mixed numeric and nominal variables, are difficult to analyze and require effective visual exploration that conveys semantic relationships of data. Traditional data mining techniques such as clustering clusters only the numeric data. Little research has been carried out in tackling the problem of clustering high cardinality nominal variables to get better insight of underlying dataset. Several works in the literature proved the likelihood of integrating data mining with warehousing to discover knowledge from data. For the seamless integration, the mined data has to be modeled in form of a data warehouse schema. Schema generation process is complex manual task and requires domain and warehousing familiarity. Automated techniques are required to generate warehouse schema to overcome the existing dependencies. To fulfill the growing analytical needs and to overcome the existing limitations, we propose a novel methodology in this paper that permits efficient analysis of mixed numeric and nominal data, effective visual data exploration, automatic warehouse schema generation and integration of data mining and warehousing. The proposed methodology is evaluated by performing case study on real-world data set. Results show that multidimensional analysis can be performed in an easier and flexible way to discover meaningful knowledge from large datasets

    Pemilihan Fitur Untuk Klasifikasi Loyalitas Pelanggan Terhadap Merek Produkfast Moving Consumer Goods (Studi Kasus: Mie Instan)

    Get PDF
    Pemilihan fitur merupakan salah satu bagian penting dan teknik yang sering digunakan dalam praproses penggalian data yang membawa efek langsung untuk mempercepat algoritma penggalian data dan meningkatkan kinerja pertambangan seperti akurasi prediksi dan hasil yang komprehensif. Penelitian ini membahas mengenai pemilihan subset fitur dalam klasifikasi loyalitas pelanggan terhadap merek bagi pengguna fast moving consumer goods(dalam penelitian ini mengambil studi kasus pada salah satu produknya yaitu mie instan) dan melakukan analisis terhadap fitur-fitur yang mempengaruhi performa klasifikasi pohon keputusan. Data yang digunakan pada penelitian ini merupakan hasil penyebaran kuisoner kepada para pelanggan mie instan di Propinsi Lampung. Data yang diperoleh memiliki fitur yang bersifat heterogen, untuk itu dilakukan pengubahan fitur menjadi fitur homogen. Dalam penelitian ini, mengkombinasikan metode UFT (unsupervised feature transformation) dan metode DMI (dynamic mutual information) untuk seleksi fitur. Metode UFT digunakan untuk transformasi fitur non-numerik menjadi fitur numerik, sehingga fitur yang bersifat heterogen menjadi fitur homogen. Metode DMI digunakan untuk pemilihan fitur. Hasil transformasi fitur diklasifikasikan menggunakan algoritmapohon keputusan. Hasil klasifikasi digunakan untukmelakukan perbandingan performa antara dataset sebelum pemilihan fitur, setelah dilakukanpemilihan fitur menggunakan metode DMI, p-Value dan perkiraan peneliti. Dari hasil pengujian terhadap model prediksi klasifikasi diperoleh fiturfitur yang mempengaruhi performa klasifikasi pohon keputusanloyalitas pelanggan. Peningkatan performa tersebut dapat dilihat pada pengimplementasian metode pemilihan fitur DMIdengan jumlah fitur sebanyak lima. Nilai akurasi, presisi, recall dan f-measure mengalami peningkatan bila dibandingkan dengan penggunaan seluruh fitur (sebelum dilakukan pemilihan fitur), metode pemilihan fitur p-value dan hasil perkiraan, masing-masing nilai tersebut secara berturutturut adalah sebesar 76.68%, 74.4%, 76.7% dan 73.5%.Fitur-fitur yang berpengaruh tersebut antara lain jumlah pengeluaran, rata-rata konsumsi, usia, alamat dan alasan berpindah merek. =========================================================== Feature selection is one of the important parts and techniques used in data mining preprocess to bring immediate effect in accelerate the data mining algorithms and improve the performance of mining such as the prediction accuracy and comprehensive results. This study discusses the subset features selection in the classification of customer loyalty to the brand for the fast moving consumer goods (this study took a case study on one of its products, i.e instant noodles) and an analysis of the features that affect the performance classification of decision tree. The used data in this study is the result of spread questionnaires to customers instant noodles in Lampung Province. The obtained data has a heterogeneous features, it is neededto carried out the transformation of features into a homogeneous features. In this study, we combine UFT (unsupervised feature transformation) and DMI (dynamic mutual information)methods for features selection. UFT methods used for transformation of non-numerical features into a numerical features, so heterogeneous features became homogeneous features. DMI methods used for feature selection.Feature transformation result is classified using decision trees algorithm. The results of classification is used to performance comparisons between the datasets before the feature selection, after the feature selection using DMI, p-Value and researchers estimate. The test results of the predictive models of classification obtained the features that affect the decision tree algorithm performance of customer loyalty. The performance enhancement can be seen in the implementation of the DMI feature selection method with a number of features as many as five features. Value of accuracy, precision, recall and F-measure increased when compared to the use of all features (prior to the selection of features), methods of feature selection p-value and methods of researcher's estimate, respectively of values is 76.68%, 74.4 %, 76.7% and 73.5%. The features that affect the performance of classification, ie expenditures, average of consumption, age of costumer, address and the reason for switching brands

    Multivariate Correlation Analysis for Supervised Feature Selection in High-Dimensional Data

    Get PDF
    The main theme of this dissertation focuses on multivariate correlation analysis on different data types and we identify and define various research gaps in the same. For the defined research gaps we develop novel techniques that address relevance of features to the target and redundancy of features amidst themselves. Our techniques aim at handling homogeneous data, i.e., only continuous or categorical features, mixed data, i.e., continuous and categorical features, and time serie

    Variable selection for classification in complex ophthalmic data: a multivariate statistical framework

    Get PDF
    Variable selection is an essential part of the process of model-building for classification or prediction. Some of the challenges of variable selection are heterogeneous variance-covariance matrices, differing scales of variables, non-normally distributed data and missing data. Statistical methods exist for variable selection however these are often univariate, make restrictive assumptions about the distribution of data or are expensive in terms of the computational power required. In this thesis I focus on filter methods of variable selection that are computationally fast and propose a metric of discrimination. The main objectives of this thesis are (1) to propose a novel Signal-to-Noise Ratio (SNR) discrimination metric accommodating heterogeneous variance-covariance matrices, (2) to develop a multiple forward selection (MFS) algorithm employing the novel SNR metric, (3) to assess the performance of the MFS-SNR algorithm compared to alternative methods of variable selection, (4) to investigate the ability of the MFS-SNR algorithm to carry out variable selection when data are not normally distributed and (5) to apply the MFS-SNR algorithm to the task of variable selection from real datasets. The MFS-SNR algorithm was implemented in the R programming environment. It calculates the SNR for subsets of variables, identifying the optimal variable during each round of selection as whichever causes the largest increase in SNR. A dataset was simulated comprising 10 variables: 2 discriminating variables, 7 non-discriminating variables and one non-discriminating variable which enhanced the discriminatory performance of other variables. In simulations the frequency of each variable’s selection was recorded. The probability of correct classification (PCC) and area under the curve (AUC) were calculated for sets of selected variables. I assessed the ability of the MFS-SNR algorithm to select variables when data are not normally distributed using simulated data. I compared the MFS-SNR algorithm to filter methods utilising information gain, chi-square statistics and the Relief-F algorithm as well as a support vector machines and an embedded method using random forests. A version of the MFS algorithm utilising Hotelling’s T2 statistic (MFS-T2) was included in this comparison. The MFS-SNR algorithm selected all 3 variables relevant to discrimination with higher or equivalent frequencies to competing methods in all scenarios. Following non-normal variable transformation the MFS-SNR algorithm still selected the variables known to be relevant to discrimination in the simulated scenarios. Finally, I studied both the MFS-SNR and MFS-T2 algorithm’s ability to carry out variable selection for disease classification using several clinical datasets from ophthalmology. These datasets represented a spectrum of quality issues such as missingness, imbalanced group sizes, heterogeneous variance-covariance matrices and differing variable scales. In 3 out of 4 datasets the MFS-SNR algorithm out-performed the MFS-T2 algorithm. In the fourth study both MFS-T2 and MFS-SNR produced the same variable selection results. In conclusion I have demonstrated that the novel SNR is an extension of Hotelling’s T2 statistic accommodating heterogeneity of variance-covariance matrices. The MFS-SNR algorithm is capable of selecting the relevant variables whether data are normally distributed or not. In the simulated scenarios the MFS-SNR algorithm performs at least as well as competing methods and outperforms the MFS-T2 algorithm when selecting variables from real clinical datasets
    corecore