2 research outputs found
A New Feature Selection Method Based on Class Association Rule
Feature selection is a key process for supervised learning algorithms. It involves discarding irrelevant attributes from the training dataset from which the models are derived. One of the vital feature selection approaches is Filtering, which often uses mathematical models to compute the relevance for each feature in the training dataset and then sorts the features into descending order based on their computed scores. However, most Filtering methods face several challenges including, but not limited to, merely considering feature-class correlation when defining a feature’s relevance; additionally, not recommending which subset of features to retain. Leaving this decision to the end-user may be impractical for multiple reasons such as the experience required in the application domain, care, accuracy, and time. In this research, we propose a new hybrid Filtering method called Class Association Rule Filter (CARF) that deals with the aforementioned issues by identifying relevant features through the Class Association Rule Mining approach and then using these rules to define weights for the available features in the training dataset. More crucially, we propose a new procedure based on mutual information within the CARF method which suggests the subset of features to be retained by the end-user, hence reducing time and effort. Empirical evaluation using small, medium, and large datasets that belong to various dissimilar domains reveals that CARF was able to reduce the dimensionality of the search space when contrasted with other common Filtering methods. More importantly, the classification models devised by the different machine learning algorithms against the subsets of features selected by CARF were highly competitive in terms of various performance measures. These results indeed reflect the quality of the subsets of features selected by CARF and show the impact of the new cut-off procedure proposed
An integrated clustering analysis framework for heterogeneous data
Big data is a growing area of research with some important research challenges that motivate
our work. We focus on one such challenge, the variety aspect. First, we introduce
our problem by defining heterogeneous data as data about objects that are described by
different data types, e.g., structured data, text, time-series, images, etc. Through our work
we use five datasets for experimentation: a real dataset of prostate cancer data and four
synthetic dataset that we have created and made them publicly available. Each dataset
covers different combinations of data types that are used to describe objects. Our strategy
for clustering is based on fusion approaches. We compare intermediate and late fusion
schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF),
where the integration process takes place at the level of calculating similarities. SMF produces
a single distance fusion matrix and two uncertainty expression matrices. We then
propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids
algorithm that utilises uncertainty calculations to improve on the clustering performance.
We evaluate our results by comparing them to clustering produced using individual elements
and show that the fusion approach produces equal or significantly better results.
Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids
does. In addition, from a theoretical point of view, our proposed Hk-medoids
algorithm has less computation complexity than the popular PAM implementation of the
k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering
by individual elements by combining cluster labels using an object co-occurrence
matrix technique. The final cluster is then derived by a hierarchical clustering algorithm.
We show that intermediate fusion for clustering of heterogeneous data is a feasible and
efficient approach using our proposed Hk-medoids algorithm