The widening performance gap between CPU and disk is significant for hash join performance. Most current hash join methods try to reduce the volume of data transferred between memory and disk. In this pa-per, we try to reduce hash-join times by reducing ran-dom I/O. We study how current algorithms incur ran-dom I/O, and propose a new hash join method, Seq+, that converts much of the random 1/0 to sequential I/O. Seq+ uses a new organization for hash buckets on disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write- and read-groups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method, and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our perfor-mance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost func-tions under-estimate the cost of current algorithms and over-estimate the cost of Seq+, the actual per-formance gain of Seq+ is likely to be even greater.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.