Detecting Near-Duplicates in Large-Scale Short Text Databases*

Abstract

Abstract. Near-duplicates are abundant in short text databases. Detecting and eliminating them is of great importance. SimFinder proposed in this paper is a fast algorithm to identify all nearduplicates in large-scale short text databases. An ad hoc term weighting scheme is employed to measure each term’s discriminative ability. A certain number of terms are extracted to form a feature list for each short text. SimFinder generates several fingerprints for each feature list, and only texts with the same fingerprint are compared with each other. An optimization procedure is employed in SimFinder to make it more efficient. Experiments indicate that SimFinder is an effective solution for short text duplicate detection with almost linear time and storage complexity. Both precision and recall of SimFinder are promising. Key words: duplicate detection; short text; term weighting; optimization; 1

    Similar works

    Full text

    thumbnail-image

    Available Versions