A Scalable Classification Algorithm for Very Large Datasets

Dursun Delen; Jin-Hwa Kim; Marilyn G. Kletke

A Scalable Classification Algorithm for Very Large Datasets

Authors: Dursun Delen
Jin-Hwa Kim
Marilyn G. Kletke
Publication date
Publisher

Abstract

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600 K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.Massive datasets, data mining, rule induction, classification, knowledge bases, refinement techniques

Similar works

Full text

Available Versions

Research Papers in Economics

Last time updated on 14/01/2014