KDX: An Indexer for Support Vector Machines Navneet Panda
- Publication date
- 2008
- Publisher
Abstract
Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applications for learning a mining or query concept, and then retrieving the “top-k ” best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and thus improve the performance of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings (e.g., γ and σ) without performance compromise. Through theoretical analysis, and empirical studies on a wide variety of datasets, we demonstrate KDX to be very effective. An earlier version of this paper appeared in the 2005 SIAM International Conference on Data Mining [24]. This version differs from the previous submission in • providing a detailed cost analysis under different scenarios, specifically designed to meet the varying needs of accuracy, speed and space requirements, • developing an approach for insertion and deletion of instances, • presenting the specific computations as well as the geometric properties used in performing the same, and • providing detailed algorithms for each of the operations necessary to create and use the index structure. Index Terms: Support vector machine, indexing, top-k retrieval. I