We describe a large-scale application of methods for finding plagiarism in
research document collections. The methods are applied to a collection of
284,834 documents collected by arXiv.org over a 14 year period, covering a few
different research disciplines. The methodology efficiently detects a variety
of problematic author behaviors, and heuristics are developed to reduce the
number of false positives. The methods are also efficient enough to implement
as a real-time submission screen for a collection many times larger.Comment: Sixth International Conference on Data Mining (ICDM'06), Dec 200