1 research outputs found
Scalable Blocking for Very Large Databases
In the field of database deduplication, the goal is to find approximately
matching records within a database. Blocking is a typical stage in this process
that involves cheaply finding candidate pairs of records that are potential
matches for further processing. We present here Hashed Dynamic Blocking, a new
approach to blocking designed to address datasets larger than those studied in
most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which
leverages the insight that rare matching values and rare intersections of
values are predictive of a matching relationship. We also present a novel use
of Locality Sensitive Hashing (LSH) to build blocking key values for huge
databases with a convenient configuration to control the trade-off between
precision and recall. HDB achieves massive scale by minimizing data movement,
using compact block representation, and greedily pruning ineffective candidate
blocks using a Count-min Sketch approximate counting data structure. We
benchmark the algorithm by focusing on real-world datasets in excess of one
million rows, demonstrating that the algorithm displays linear time complexity
scaling in this range. Furthermore, we execute HDB on a 530 million row
industrial dataset, detecting 68 billion candidate pairs in less than three
hours at a cost of $307 on a major cloud service