University of Edinburgh. College of Science and Engineering. School of Informatics.
Abstract
Institute for Communicating and Collaborative SystemsWe examine the task of feature selection, which is a method of forming simplified
descriptions of complex data for use in probabilistic classifiers. Feature selection typically
requires a numerical measure or metric of the desirability of a given set of features.
The thesis considers a number of existing metrics, with particular attention to
those based on entropy and other quantities derived from information theory. A useful
new perspective on feature selection is provided by the concepts of partitioning and
encoding of data by a feature set. The ideas of partitioning and encoding, together
with the theoretical shortcomings of existing metrics, motivate a new class of feature
selection metrics based on conditional entropy. The simplest of the new metrics is
referred to as expected partition entropy or EPE.
Performances of the new and existing metrics are compared by experiments with
a simplified form of part-of-speech tagging and with classification of Reuters news
stories by topic. In order to conduct the experiments, a new class of accelerated feature
selection search algorithms is introduced; a member of this class is found to provide
significantly increased speed with minimal loss in performance, as measured by feature
selection metrics and accuracy on test data. The comparative performance of existing
metrics is also analysed, giving rise to a new general conjecture regarding the wrapper
class of metrics. Each wrapper is inherently tied to a specific type of classifier. The
experimental results support the idea that a wrapper selects feature sets which perform
well in conjunction with its own particular classifier, but this good performance cannot
be expected to carry over to other types of model.
The new metrics introduced in this thesis prove to have substantial advantages over
a representative selection of other feature selection mechanisms: Mutual information,
frequency-based cutoff, the Koller-Sahami information loss measure, and two different
types of wrapper method. Feature selection using the new metrics easily outperforms
other filter-based methods such as mutual information; additionally, our approach attains
comparable performance to a wrapper method, but at a fraction of the computational
expense. Finally, members of the new class of metrics succeed in a case where
the Koller-Sahami metric fails to provide a meaningful criterion for feature selection