Search CORE

1 research outputs found

Differential prioritization in feature selection for multiclass molecular classification

Author: Ooi Chia Huey (3624431)
Publication venue
Publication date
Field of study

The aim of the thesis is to develop a filter-based feature selection (FS) technique for multi class molecular classification. Molecular classification involves the classification of samples into groups of biological phenotypes based on high-dimensional gene expression data obtained from microarray experiments. The multi class nature of the classification problems demands work on two specific areas: (a) differential prioritization and (b) combinations between different decomposition paradigms of FS and classification. FS aims to form, from the larger set of features in the dataset, a smaller subset of features which are capable of producing the best classification accuracy. This subset is called the predictor set. Relevance and redundancy have always been acknowledged as important criteria in the formation of the predictor set in filter-based FS. These two criteria are included as elements in the predictor set score, which measures the goodness of the predictor set. However, especially in a multiclass problem, we propose that a third criterion is necessary for the formation of the predictor set. This third criterion is the differential prioritization, a novel criterion which dictates the priority of maximizing relevance to the priority of minimizing redundancy. Differential prioritization ensures that the optimal balance between relevance and redundancy is achieved based on the number of classes in the classification problem. This is because as the number of classes increases, the relative importance of minimizing redundancy also increases. For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class problem is more important than minimizing redundancy in a two-class problem. An outcome of the work on differential prioritization is the development of a superior measure for redundancy. Redundancy in the predictor is defined as the amount of similarity or repetition of information among the members of the predictor set. Traditionally, redundancy is measured by directly summing up the pairwise similarity among the members of the predictor set. It is then minimized by defining it as the denominator in a ratio-based predictor set score. This method of measuring and minimizing redundancy faces the problem of singularity at nearminimum redundancy, which results in a skewed representation of the goodness of the predictor set. This motivates us to come up with an alternative measure for redundancy which circumvents the aforementioned problem. In multiclass problems, following the 'divide-and-conquer' philosophy, FS may be decomposed into several two-class sub-problems. The manner of the decomposition determines the decomposition paradigm for the FS problem. This is also true for multiclass classification, which may also be decomposed into several two-class sub-problems. The problem of FS and the problem of classification are inevitably linked to each other, since one of the aims of FS is to aid classification. However, there exists no formal approach for systematically combining the twin problems of FS and classification based on the decomposition paradigm used in each problem. Hence, we propose a system for combining the FS and the classification problems which will enable us to examine the effect of different combinations between decomposition paradigms ofFS and classification on accuracy in multiclass molecular classification

FigShare