Location of Repository

Robust methods in data mining

By K.S. Mwitondi

Abstract

The thesis focuses on two problems in Data Mining, namely clustering, an exploratory technique to group observations in similar groups, and classification, a technique used to assign new observations to one of the known groups. A thorough study of the two problems, which are also known in the Machine Learning literature as unsupervised and supervised classification respectively, is central to decision making in different fields - the thesis seeks to contribute towards that end.\ud \ud In the first part of the thesis we consider whether robust methods can be applied to clustering - in particular, we perform clustering on fuzzy data using two methods\ud originally developed for outlier-detection. The fuzzy data clusters are characterised by two intersecting lines such that points belonging to the same cluster lie close to\ud the same line. This part of the thesis also investigates a new application of finite mixture of normals to the fuzzy data problem.\ud \ud The second part of the thesis addresses issues relating to classification - in particular, classification trees and boosting. The boosting algorithm is a relative newcomer\ud to the classification portfolio that seeks to enhance the performance of classifiers by iteratively re-weighting the data according to their previous classification status.\ud We explore the performance of "boosted" trees (mainly stumps) based on 3 different models all characterised by a sine-wave boundary. We also carry out a thorough study of the factors that affect the boosting algorithm.\ud \ud Other results include a new look at the concept of randomness in the classification context, particularly because the form of randomness in both training and testing\ud data has directly affects the accuracy and reliability of domain- partitioning rules. Further, we provide statistical interpretations of some of the classification-related\ud concepts, originally used in Computer Science, Machine Learning and Artificial Intelligence. This is important since there exists a need for a unified interpretation of\ud some of the "landmark" concepts in various disciplines, as a step forward towards seeking the principles that can guide and strengthen practical applications

Publisher: Statistics (Leeds)
Year: 2003
OAI identifier: oai:etheses.whiterose.ac.uk:807

Suggested articles

Preview

Citations

  1. (1984). Classificafton and RegressZon Trees.
  2. (1989). Introduction to Generalized Linear Models. doi
  3. (1977). Maximum likelihood from incomplete data via the EM algorithm.
  4. (1994). Multivariate Statistical Modelling Based on Generalized Linear Models. doi
  5. (1995). Neural Networks for Pattern Recognition. doi
  6. (1985). Plots, Transformations, and Regression : an Introduction to Graphical Methods of Diagnostic RegressZon AnalysZs. doi
  7. (2000). Robust Diagnostic Regression Analysis. doi
  8. (1995). Support vector network. doi
  9. (1994). Very fast robust methods for the detection of multiple outliers. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.