2 research outputs found

    Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

    Full text link
    [EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321

    Métodos eficientes de deteção de plágio em grandes corpora

    Get PDF
    O crescente aumento da quantidade de informação publicada na Web, na forma de publicações literárias, científicas e académicas, implica uma constante verificação da integridade de novos documentos (suspeitos) em função dos documentos existentes (fonte). Surge, portanto, a necessidade de aumentar: a eficiência na redução do espaço de procura em grandes conjuntos de documentos fonte; a eficácia na deteção de plágios cada vez mais sofisticados. Nesta dissertação descreve-se uma metodologia baseada em dois atos: (i) indexação do corpus fonte, com um motor de pesquisa (código aberto), e extração de documentos fonte (candidatos), através de pesquisa por palavras relevantes e caraterísticas textuais; (ii) localização de excertos de plágio em documentos suspeitos, com uma métrica robusta, criada através da aplicação de programação genética sobre as caraterísticas de dados plagiados. Os resultados experimentais obtidos mostram uma redução significativa no tempo de processamento, devido à estratificação do corpus, assim como a capacidade de detetar eficientemente excertos de plágio literal, modificado e ofuscado.The increasing information volume published in the Web, either in terms of literary publications or scientific and academic papers, requires a constant surveillance to verify the integrity of daily entering new documents (suspicious), on the basis of the existing ones (sources). As a consequence arises the need to improve the efficiency in reducing the search space for large sets of documents source and the effectiveness in detecting increasingly sophisticated plagiarism events. In this dissertation it is described a methodology based on two actions: (I) indexing the source corpus, with a search engine (open-source), and the extraction of source documents (candidates) by searching for key relevant words and textual features; (II) locating plagiarized passages in suspicious documents with a hybrid metric created by applying genetic programming on the characteristics of plagiarized data. The results show a significant reduction in processing time due to the corpus stratification, as well as a high success rate in detecting plagiarism passages, having none, low, and high obfuscation. The experimental results show a significant reduction in processing time due to stratification of the corpus, as well as the ability to detect plagiarism extracts of diffrent kind: literal, modified and obfuscated
    corecore