Search CORE

1 research outputs found

Statistical Approaches for Binary and Categorical Data Modeling

Author: Alalyan Fahdah Abdullah
Publication venue
Publication date: 22/10/2020
Field of study

Nowadays a massive amount of data is generated as the development of technology and services has accelerated. Therefore, the demand for data clustering in order to gain knowledge has increased in many sectors such as medical sciences, risk assessment and product sales. Moreover, binary data has been widely used in various applications including market basket data and text documents analysis. While applying classic widely used k-means method is inappropriate to cluster binary data, we propose an improvement of K-medoids algorithm using binary similarity measures instead of Euclidean distance which is generally deployed in clustering algorithms. In addition to K-medoids clustering method, agglomerative hierarchical clustering methods based on Gaussian probability models have recently shown to be efficient in different applications. However, the emerging of pattern recognition applications where the features are binary or integer-valued demand extending research efforts to such data types. We propose a hierarchical clustering framework for clustering categorical data based on Multinomial and Bernoulli mixture models. We have compared two widely used density-based distances, namely; Bhattacharyya and Kullback-Leibler. The merits of our proposed clustering frameworks have been shown through extensive experiments on clustering text, binary images categorization and images categorization. The development of generative/discriminative approaches for classifying different kinds of data has attracted scholars’ attention. Considering the strengths and weaknesses of both approaches, several hybrid learning approaches which combined the desirable properties of both have been developed. Our contribution is to combine Support Vector Machines (SVMs) and Bernoulli mixture model in order to classify binary data. We propose using Bernoulli mixture model for generating probabilistic kernels for SVM based on information divergence. These kernels make intelligent use of unlabeled binary data to achieve good data discrimination. We evaluate the proposed hybrid learning approach by classifying binary and texture images

Concordia University Research Repository