Article thumbnail

CONQUEST: A distributed tool for constructing summaries of high-dimensional discrete-attributed datasets

By Jie Chi, Mehmet Koyutürk and Ananth Grama

Abstract

The problem of constructing bounded-error summaries of binary attributed data of very high dimensions is an important and difficult one. These summaries enable more expensive analysis techniques to be applied efficiently with little loss in accuracy. Recent work in this area has resulted in the use of discrete linear algebraic transforms to construct such summaries efficiently. This paper addresses the problem of constructing summaries of distributed datasets. Specifically, the problem can be stated as follows: given a set of n discrete attributed vectors distributed across p sites, construct a summary of k << n vectors such that each of the input vectors is within given bounded distance from some output vector. In addition to being algorithmically efficient (i.e., must do no more work than corresponding serial algorithm), the distributed formulation must have low parallelization overheads. We present here, CONQUEST, a tool that achieves excellent performance and scalability for summarizing distributed datasets. In contrast to traditional parallel techniques that distribute the kernel operations, CONQUEST uses a less aggressive parallel formulation that relies on the principle of sampling to reduce communication overhead while maintaining high accuracy. Specifically, each individual site computes its local patterns independently. Various sites cooperate within dynamically orchestrated workgroups to construct consensus patters from these local patterns. Individual sites then decide to participate in the consensus or leave the group. Experimental results on a set of Intel Xeon servers demonstrate that this strategy is capable of excellent performance in terms of compression time, ratio, and accuracy with respect to postprocessing tasks. The communication overhead associated with CONQUEST is also shown to be minimal, making it ideally suited to wide-area deployment

Topics: distributed data mining
Year: 2004
DOI identifier: 10.1137/1.9781611972740.15
OAI identifier: oai:CiteSeerX.psu:10.1.1.215.3042
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.siam.org/proceeding... (external link)

  • To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.

    Suggested articles