This paper is a deep investigation of cross-language plagiarism detection
methods on a new recently introduced open dataset, which contains parallel and
comparable collections of documents with multiple characteristics (different
genres, languages and sizes of texts). We investigate cross-language plagiarism
detection methods for 6 language pairs on 2 granularities of text units in
order to draw robust conclusions on the best methods while deeply analyzing
correlations across document styles and languages.Comment: Accepted to BUCC (10th Workshop on Building and Using Comparable
Corpora) colocated with ACL 201