4 research outputs found

    Counting co-occurrences in citations to identify plagiarised text fragments

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40802-1_19Research in external plagiarism detection is mainly concerned with the comparison of the textual contents of a suspicious document against the contents of a collection of original documents. More recently, methods that try to detect plagiarism based on citation patterns have been proposed. These methods are particularly useful for detecting plagiarism in scientific publications. In this work, we assess the value of identifying co-occurrences in citations by checking whether this method can identify cases of plagiarism in a dataset of scientific papers. Our results show that most the cases in which co-occurrences were found indeed correspond to plagiarised passagesThis work was partially funded by CNPq (478979/2012-6). Solange Pertile’s 5-month internship at NLE Lab of Universitat Polit`ecnica de Val`encia was funded by CAPES. P.Rosso’s work was carried out in the framework of the the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONSFinding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research projects. We thank the authors of [5] for sharing their dataset with us and Enrique Flores for the preliminary brainstorming on how to identify co-occurrences in citationsPertile, SDL.; Rosso, P.; Moreira, VP. (2013). Counting co-occurrences in citations to identify plagiarised text fragments. En Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer Verlag (Germany). 150-154. https://doi.org/10.1007/978-3-642-40802-1_19S150154CrossCheck, http://www.crossref.org/crosscheck/Journal of Zhejiang University-Science, http://www.zju.edu.cn/jzus/PAN, http://www.pan.webis.dePlagiarism corpus, http://www.c2learn.com/plagiarism/corpus/v1/Alzahrani, S., Palade, V., Salim, N., Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications. JASIST 63(2), 286–312 (2012)Barrón-Cedeño, A., Vila, M., Marti, A., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics 39(4) (2013)Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD, pp. 807–818 (2010)Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. In: DocEng, pp. 249–258 (2011)Gupta, P., Rosso, P.: Text reuse with ACL (upward) trends. In: ACL 2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82 (2012)Mccabe, D.L.: Cheating among college and university students: A north american perspective. International Journal for Educational Integrity 1 (2005)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Language Resources and Evaluation 45(1), 45–62 (2011)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Stamatatos, E., Rosso, P., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: CLEF 2013 - Working Notes (September 2013)Ritt, M., Costa, A.M., Mergen, S., Orengo, V.M.: An integer linear programming approach for approximate string comparison. European Journal of Operational Research 198(3), 706–714 (2009)Zhang, Y.: Crosscheck: an effective tool for detecting plagiarism. Learned Publishing 23, 9–14 (2010

    Combining content- and citation-based metrics for plagiarism detection in scientific papers

    No full text
    A grande quantidade de artigos científicos disponíveis on-line faz com que seja mais fácil para estudantes e pesquisadores reutilizarem texto de outros autores, e torna mais difícil a verificação da originalidade de um determinado texto. Reutilizar texto sem creditar a fonte é considerado plágio. Uma série de estudos relatam a alta prevalência de plágio no meio acadêmico e científico. Como consequência, inúmeras instituições e pesquisadores têm se dedicado à elaboração de sistemas para automatizar o processo de verificação de plágio. A maioria dos trabalhos existentes baseia-se na análise da similaridade do conteúdo textual dos documentos para avaliar a existência de plágio. Mais recentemente, foram propostas métricas de similaridade que desconsideram o texto e analisam apenas as citações e/ou referências bibliográficas compartilhadas entre documentos. Entretanto, casos em que o autor não referencia a fonte original pode passar despercebido pelas métricas baseadas apenas na análise de referências/citações. Neste contexto, a solução proposta é baseada na hipótese de que a combinação de métricas de similaridade de conteúdo e de citações/referências pode melhorar a qualidade da detecção de plágio. Duas formas de combinação são propostas: (i) os escores produzidos pelas métricas de similaridade são utilizados para ranqueamento dos pares de documentos e (ii) os escores das métricas são utilizados para construir vetores de características que serão usados por algoritmos de Aprendizagem de Máquina para classificar os documentos. Os experimentos foram realizados com conjuntos de dados reais de artigos científicos. A avaliação experimental mostra que a hipótese foi confirmada quando a combinação das métricas de similaridade usando Aprendizagem de Máquina é comparada com a combinação simples. Ainda, ambas as combinações apresentaram ganhos quando comparadas com as métricas aplicadas de forma individual.The large amount of scientific documents available online makes it easier for students and researchers reuse text from other authors, and makes it difficult to verify the originality of a given text. Reusing text without crediting the source is considered plagiarism. A number of studies have reported on the high prevalence of plagiarism in academia. As a result, many institutions and researchers have developed systems that automate the plagiarism detection process. Most of the existing work is based on the analysis of the similarity of the textual content of documents to assess the existence of plagiarism. More recently, similarity metrics that ignore the text and just analyze the citations and/or references shared between documents have been proposed. However, cases in which the author does not reference the original source may go unnoticed by metrics based only on the references/citations analysis. In this context, the proposed solution is based on the hypothesis that the combination of content similarity metrics and references/citations can improve the quality of plagiarism detection. Two forms of combination are proposed: (i) scores produced by the similarity metrics are used to ranking of pairs of documents and (ii) scores of metrics are used to construct feature vectors that are used by algorithms machine learning to classify documents. The experiments were performed with real data sets of papers. The experimental evaluation shows that the hypothesis was confirmed when the combination of the similarity metrics using machine learning is compared with the simple combining. Also, both compounds showed gains when compared with the metrics applied individually

    Combining content- and citation-based metrics for plagiarism detection in scientific papers

    No full text
    A grande quantidade de artigos científicos disponíveis on-line faz com que seja mais fácil para estudantes e pesquisadores reutilizarem texto de outros autores, e torna mais difícil a verificação da originalidade de um determinado texto. Reutilizar texto sem creditar a fonte é considerado plágio. Uma série de estudos relatam a alta prevalência de plágio no meio acadêmico e científico. Como consequência, inúmeras instituições e pesquisadores têm se dedicado à elaboração de sistemas para automatizar o processo de verificação de plágio. A maioria dos trabalhos existentes baseia-se na análise da similaridade do conteúdo textual dos documentos para avaliar a existência de plágio. Mais recentemente, foram propostas métricas de similaridade que desconsideram o texto e analisam apenas as citações e/ou referências bibliográficas compartilhadas entre documentos. Entretanto, casos em que o autor não referencia a fonte original pode passar despercebido pelas métricas baseadas apenas na análise de referências/citações. Neste contexto, a solução proposta é baseada na hipótese de que a combinação de métricas de similaridade de conteúdo e de citações/referências pode melhorar a qualidade da detecção de plágio. Duas formas de combinação são propostas: (i) os escores produzidos pelas métricas de similaridade são utilizados para ranqueamento dos pares de documentos e (ii) os escores das métricas são utilizados para construir vetores de características que serão usados por algoritmos de Aprendizagem de Máquina para classificar os documentos. Os experimentos foram realizados com conjuntos de dados reais de artigos científicos. A avaliação experimental mostra que a hipótese foi confirmada quando a combinação das métricas de similaridade usando Aprendizagem de Máquina é comparada com a combinação simples. Ainda, ambas as combinações apresentaram ganhos quando comparadas com as métricas aplicadas de forma individual.The large amount of scientific documents available online makes it easier for students and researchers reuse text from other authors, and makes it difficult to verify the originality of a given text. Reusing text without crediting the source is considered plagiarism. A number of studies have reported on the high prevalence of plagiarism in academia. As a result, many institutions and researchers have developed systems that automate the plagiarism detection process. Most of the existing work is based on the analysis of the similarity of the textual content of documents to assess the existence of plagiarism. More recently, similarity metrics that ignore the text and just analyze the citations and/or references shared between documents have been proposed. However, cases in which the author does not reference the original source may go unnoticed by metrics based only on the references/citations analysis. In this context, the proposed solution is based on the hypothesis that the combination of content similarity metrics and references/citations can improve the quality of plagiarism detection. Two forms of combination are proposed: (i) scores produced by the similarity metrics are used to ranking of pairs of documents and (ii) scores of metrics are used to construct feature vectors that are used by algorithms machine learning to classify documents. The experiments were performed with real data sets of papers. The experimental evaluation shows that the hypothesis was confirmed when the combination of the similarity metrics using machine learning is compared with the simple combining. Also, both compounds showed gains when compared with the metrics applied individually

    Comparing and Combining Content- and Citation-based Approaches for Plagiarism Detection

    Full text link
    The vast amount of scienti c publications available online makes it easier for students and researchers reusing text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies report on the high prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scienti c papers. In this context, the contributions of this paper are twofold: (i) we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and (ii) we compare Content and Citation-based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carried out experiments with real datasets of scienti c papers and concluded that a combination of the methods can be bene ficial.This work was funded by CNPq project 478979/2012-6. S. L. Pertile receives a grant from CAPES. We thank Parth Gupta for sharing his results with us. We are grateful to the anonymous reviewers who made several suggestions to improve the article. Finally, we thank the voluntary annotators for identifying the significant reuse cases.Pertile, SDL.; Moreira, VP.; Rosso, P. (2015). Comparing and Combining Content- and Citation-based Approaches for Plagiarism Detection. Journal of the Association for Information Science and Technology. 1-16. doi:10.1002/asi.23593S11
    corecore