Detecting Near-Duplicates in Large-Scale Short Text Databases*

Caichun Gong; Xueqi Cheng Shuo Bai; Yulan Huang

Detecting Near-Duplicates in Large-Scale Short Text Databases*

Authors: Caichun Gong
Xueqi Cheng Shuo Bai
Yulan Huang
Publication date
Publisher

Abstract

Abstract. Near-duplicates are abundant in short text databases. Detecting and eliminating them is of great importance. SimFinder proposed in this paper is a fast algorithm to identify all nearduplicates in large-scale short text databases. An ad hoc term weighting scheme is employed to measure each term’s discriminative ability. A certain number of terms are extracted to form a feature list for each short text. SimFinder generates several fingerprints for each feature list, and only texts with the same fingerprint are compared with each other. An optimization procedure is employed in SimFinder to make it more efficient. Experiments indicate that SimFinder is an effective solution for short text duplicate detection with almost linear time and storage complexity. Both precision and recall of SimFinder are promising. Key words: duplicate detection; short text; term weighting; optimization; 1

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.98.61...

Last time updated on 23/10/2014