Location of Repository

Duplicate Detection in the Reuters Collection

By M. Sanderson

Abstract

While conducting some experiments with the Reuters collection, it was discovered\ud that contained within it were a number of documents that were exact duplicates of\ud each other (see Figure 1). A short study was conducted to try to discover how many\ud such documents there were. The results of this study revealed that the notion of a\ud duplicate document was not as simple as first thought.\ud \ud The contents of this report are as follows. A brief review of previous duplicate detection\ud research will be presented, followed by a description of the methods and results of\ud the duplicate detection work conducted here. In addition, there is an appendix holding\ud the document ids of the various types of duplicate found

Publisher: Department of Computing Science
Year: 1997
OAI identifier: oai:eprints.whiterose.ac.uk:4571

Suggested articles

Preview


To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.