Search CORE

21 research outputs found

From benchtop to raceway : spectroscopic signatures of dynamic biological processes in algal communities.

Author: August Andrew
Boriah Varun
Collins Aaron M.
Dempster Thomas A.
Dwyer Brian P.
Garcia Omar Fidel
Gharagozloo Patricia E.
Hanson David T.
James Scott Carlton
Janardhanam Vijay
Jones Howland D. T.
Lopez-Nieves Samuel
McGowen John A.
Parchert Kylea Joy
Powell Amy Jo
Reichardt Thomas A.
Roesgen John
Ruffing Anne M.
Sansom Kurt
Timlin Jerilyn Ann
Trahan Christine Alexandra
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/09/2012
Field of study

Crossref

UNT Digital Library

Similarity Measures for Categorical Data -- A Comparative Study

Author: Shyam Boriah
Varun Chandola
Vipin Kumar
Publication venue
Publication date: 01/01/2007
Field of study

Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance

CiteSeerX

University of Minnesota Digital Conservancy

A Framework for Exploring Categorical Data

Author: Shyam Boriah
Varun Chandola
Vipin Kumar
Publication venue
Publication date: 01/04/2012
Field of study

In this paper, we present a framework for categorical data analysis which allows such data sets to be explored using a rich set of techniques that are only applicable to continuous data sets. We introduce the concept of separability statistics in the context of exploratory categorical data analysis. We show how these statistics can be used as a way to map categorical data to continuous space given a labeled reference data set. This mapping enables visualization of categorical data using techniques that are applicable to continuous data. We show that in the transformed continuous space, the performance of the standard k-nn based outlier detection technique is comparable to the performance of the k-nn based outlier detection technique using the best of the similarity measures designed for categorical data. The proposed framework can also be used to devise similarity measures best suited for a particular type of data set.

CiteSeerX

Crossref

A study of time series noise reduction techniques in the context of land cover change detection

Author: Boriah Shyam
Brugere Ivan
Chen Xi
Kumar Vipin
Mithal Varun
VangalaReddy Sruthi
Publication venue
Publication date: 12/08/2011
Field of study

The purpose of this study is to introduce concepts relevant to performance of (i) change detection algorithms within (ii) various regional contexts with differing noise characteristics according to (iii) differing strategies of noise reduction. The relevant interrelations of these three elements are presented, and focused analysis is presented from the perspective of varying (i) and (iii) for a comparative analysis across (ii). Six smoothing methods has been studied in this work: Savitzky-Golay (SG) method [7], The Savitzky-Golay method iterated to upper envelope (SG-Itr) [3], Harmonic Analysis of Time Series (HANTS) [6], Double Logistic function fitting method (DL) [1], Data Assimilation method(DA) [5]and a naive outlier identification and imputation scheme (SO). In this work, we enumerate three general data characteristics, especially relevant in the MODIS EVI data, which a given noise reduction technique may take advantage of: neighborhood coherence, quality annotation and background model. For a noise reduction technique we identify the following two questions to be of relevance: • Which observations in the time series should be imputed? • How are these observations to be imputed

CiteSeerX

University of Minnesota Digital Conservancy