21 research outputs found

    Similarity Measures for Categorical Data -- A Comparative Study

    No full text
    Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance

    A Framework for Exploring Categorical Data

    No full text
    In this paper, we present a framework for categorical data analysis which allows such data sets to be explored using a rich set of techniques that are only applicable to continuous data sets. We introduce the concept of separability statistics in the context of exploratory categorical data analysis. We show how these statistics can be used as a way to map categorical data to continuous space given a labeled reference data set. This mapping enables visualization of categorical data using techniques that are applicable to continuous data. We show that in the transformed continuous space, the performance of the standard k-nn based outlier detection technique is comparable to the performance of the k-nn based outlier detection technique using the best of the similarity measures designed for categorical data. The proposed framework can also be used to devise similarity measures best suited for a particular type of data set.

    A study of time series noise reduction techniques in the context of land cover change detection

    No full text
    The purpose of this study is to introduce concepts relevant to performance of (i) change detection algorithms within (ii) various regional contexts with differing noise characteristics according to (iii) differing strategies of noise reduction. The relevant interrelations of these three elements are presented, and focused analysis is presented from the perspective of varying (i) and (iii) for a comparative analysis across (ii). Six smoothing methods has been studied in this work: Savitzky-Golay (SG) method [7], The Savitzky-Golay method iterated to upper envelope (SG-Itr) [3], Harmonic Analysis of Time Series (HANTS) [6], Double Logistic function fitting method (DL) [1], Data Assimilation method(DA) [5]and a naive outlier identification and imputation scheme (SO). In this work, we enumerate three general data characteristics, especially relevant in the MODIS EVI data, which a given noise reduction technique may take advantage of: neighborhood coherence, quality annotation and background model. For a noise reduction technique we identify the following two questions to be of relevance: • Which observations in the time series should be imputed? • How are these observations to be imputed
    corecore