1 research outputs found
Differential prioritization in feature selection for multiclass molecular classification
The aim of the thesis is to develop a filter-based feature selection (FS) technique for multi class molecular classification. Molecular classification involves the classification of samples into groups of biological phenotypes based on high-dimensional gene expression data obtained from
microarray experiments. The multi class nature of the classification problems demands work on
two specific areas: (a) differential prioritization and (b) combinations between different
decomposition paradigms of FS and classification.
FS aims to form, from the larger set of features in the dataset, a smaller subset of features which are capable of producing the best classification accuracy. This subset is called the predictor set.
Relevance and redundancy have always been acknowledged as important criteria in the
formation of the predictor set in filter-based FS. These two criteria are included as elements in
the predictor set score, which measures the goodness of the predictor set. However, especially
in a multiclass problem, we propose that a third criterion is necessary for the formation of the
predictor set. This third criterion is the differential prioritization, a novel criterion which
dictates the priority of maximizing relevance to the priority of minimizing redundancy.
Differential prioritization ensures that the optimal balance between relevance and redundancy is
achieved based on the number of classes in the classification problem. This is because as the
number of classes increases, the relative importance of minimizing redundancy also increases.
For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class
problem is more important than minimizing redundancy in a two-class problem.
An outcome of the work on differential prioritization is the development of a superior measure
for redundancy. Redundancy in the predictor is defined as the amount of similarity or repetition
of information among the members of the predictor set. Traditionally, redundancy is measured
by directly summing up the pairwise similarity among the members of the predictor set. It is
then minimized by defining it as the denominator in a ratio-based predictor set score. This
method of measuring and minimizing redundancy faces the problem of singularity at nearminimum
redundancy, which results in a skewed representation of the goodness of the predictor
set. This motivates us to come up with an alternative measure for redundancy which
circumvents the aforementioned problem.
In multiclass problems, following the 'divide-and-conquer' philosophy, FS may be decomposed
into several two-class sub-problems. The manner of the decomposition determines the decomposition paradigm for the FS problem. This is also true for multiclass classification,
which may also be decomposed into several two-class sub-problems. The problem of FS and
the problem of classification are inevitably linked to each other, since one of the aims of FS is to aid classification. However, there exists no formal approach for systematically combining the
twin problems of FS and classification based on the decomposition paradigm used in each
problem. Hence, we propose a system for combining the FS and the classification problems
which will enable us to examine the effect of different combinations between decomposition
paradigms ofFS and classification on accuracy in multiclass molecular classification