71 research outputs found

    Counting co-occurrences in citations to identify plagiarised text fragments

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40802-1_19Research in external plagiarism detection is mainly concerned with the comparison of the textual contents of a suspicious document against the contents of a collection of original documents. More recently, methods that try to detect plagiarism based on citation patterns have been proposed. These methods are particularly useful for detecting plagiarism in scientific publications. In this work, we assess the value of identifying co-occurrences in citations by checking whether this method can identify cases of plagiarism in a dataset of scientific papers. Our results show that most the cases in which co-occurrences were found indeed correspond to plagiarised passagesThis work was partially funded by CNPq (478979/2012-6). Solange Pertile’s 5-month internship at NLE Lab of Universitat Polit`ecnica de Val`encia was funded by CAPES. P.Rosso’s work was carried out in the framework of the the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and the European Commission WIQ-EI IRSES (no. 269180) and DIANA-APPLICATIONSFinding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research projects. We thank the authors of [5] for sharing their dataset with us and Enrique Flores for the preliminary brainstorming on how to identify co-occurrences in citationsPertile, SDL.; Rosso, P.; Moreira, VP. (2013). Counting co-occurrences in citations to identify plagiarised text fragments. En Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer Verlag (Germany). 150-154. https://doi.org/10.1007/978-3-642-40802-1_19S150154CrossCheck, http://www.crossref.org/crosscheck/Journal of Zhejiang University-Science, http://www.zju.edu.cn/jzus/PAN, http://www.pan.webis.dePlagiarism corpus, http://www.c2learn.com/plagiarism/corpus/v1/Alzahrani, S., Palade, V., Salim, N., Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications. JASIST 63(2), 286–312 (2012)Barrón-Cedeño, A., Vila, M., Marti, A., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics 39(4) (2013)Cortez, E., da Silva, A.S., Gonçalves, M.A., de Moura, E.S.: Ondux: on-demand unsupervised learning for information extraction. In: SIGMOD, pp. 807–818 (2010)Gipp, B., Meuschke, N.: Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence. In: DocEng, pp. 249–258 (2011)Gupta, P., Rosso, P.: Text reuse with ACL (upward) trends. In: ACL 2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82 (2012)Mccabe, D.L.: Cheating among college and university students: A north american perspective. International Journal for Educational Integrity 1 (2005)Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Language Resources and Evaluation 45(1), 45–62 (2011)Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Stamatatos, E., Rosso, P., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In: CLEF 2013 - Working Notes (September 2013)Ritt, M., Costa, A.M., Mergen, S., Orengo, V.M.: An integer linear programming approach for approximate string comparison. European Journal of Operational Research 198(3), 706–714 (2009)Zhang, Y.: Crosscheck: an effective tool for detecting plagiarism. Learned Publishing 23, 9–14 (2010

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    An approach to source-code plagiarism detection investigation using latent semantic analysis

    Get PDF
    This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments

    An approach to source-code plagiarism detection investigation using latent semantic analysis

    Get PDF
    This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central

    Get PDF
    Citation-based similarity measures such as Bibliographic Coupling and Co-Citation are an integral component of many information retrieval systems. However, comparisons of the strengths and weaknesses of measures are challenging due to the lack of suitable test collections. This paper presents CITREC, an open evaluation framework for citation-based and text-based similarity measures. CITREC prepares the data from the PubMed Central Open Access Subset and the TREC Genomics collection for a citation-based analysis and provides tools necessary for performing evaluations of similarity measures. To account for different evaluation purposes, CITREC implements 35 citation-based and text-based similarity measures, and features two gold standards. The first gold standard uses the Medical Subject Headings (MeSH) thesaurus and the second uses the expert relevance feedback that is part of the TREC Genomics collection to gauge similarity. CITREC additionally offers a system that allows creating user defined gold standards to adapt the evaluation framework to individual information needs and evaluation purposes.ye

    La atribución de fuentes en la escritura académica de alumnos de grado : relevamiento de estrategias

    Get PDF
    Academic writing has its own conventions and patterns which make it different from other types of writing. Composing academic essays and papers poses difficulties to students who need specific instruction to acquire the rules on which different academic discourses are built. Every discipline has its own mechanisms that imply specific discourse strategies that function as models. According to Swales (1990) academic writing implies knowledge of the discipline’s conventions. These conventions and discourse strategies will be present in the students’ texts as essential elements in their future professional writing demands and should be acquired during their undergraduate years. This work explores key attribution strategies required in advanced EFL university student compositions and analyzes to what extent a group of learners at the National University of Córdoba (UNC) in Argentina use them in writing their own texts and to what degree they acknowledge and incorporate secondary sources in their works. The students were asked to write an assignment in class, as part of the requirements for the Language V writing project. Four basic linguistic resources of secondary source use were analyzed – citation, paraphrasing, quotation format, and the use of reporting verbs. The findings offer insights into student practices and suggest the need for greater and continuous pedagogical support to enable students to achieve competence in secondary source use

    Investigating plagiarism in the academic context

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Inglês: Estudos Linguísticos e Literários, Florianópolis, 2016.Abstract : This thesis, organised into seven chapters, is about plagiarism in the academic context. It presents different perspectives to be considered in order to define plagiarism, an investigation of its origin and specificities in academia. Then, two different panoramas about how plagiarism has been treated are presented: 1) at UFSC (Universidade Federal de Santa Catarina), in which some problems were identified; and 2) at the UoB (University of Birmingham), as a series of consistent efforts have been made in the UK since 2002 to deal with plagiarism. The objective, then, was to analyse the approach adopted at the UoB in order to support the development of suggestions to improve the situation at UFSC. Therefore, it was possible to produce a proposal to UFSC in order to work on detection and prevention through the creation and adoption of anti-plagiarism policies. These policies include the establishment of specific institutional rules and of an institutional structure to deal with cases of plagiarism, the offer of courses on academic writing, and the oriented employment of detection software. The thesis also explores the difference between intentional and unintentional plagiarism as well as some strategies that are used to conceal especially the former, such as translation. It was intended to emphasise other aspects related to plagiarism besides the usually focused ethical concerns, which are relevant, but they are out of the reach of teachers and linguists. Results pointed to the need of long-term changes in education, such as through the teaching of academic writing skills, and also of shorter-term measures, such as the implementation of policies to better approach plagiarism in universities. Such measures may provide a more effective means to combat plagiarism.Esta tese, organizada em sete capítulos, trata de plágio no contexto acadêmico. São apresentadas discussões na área acerca da definição de plágio, sua origem e as especificidades do tema na academia. Em seguida, são introduzidos dois panoramas a respeito de como o plágio tem sido abordado: 1) na UFSC, onde se constataram alguns problemas; e 2) na Universidade Birmingham, em que foram encontradas importantes medidas no enfrentamento do plágio, pois desde 2002 o Reino Unido vem combatendo o problema. O objetivo do estudo foi o de analisar a abordagem adotada em Birmingham para assim se criar sugestões que pudessem ser aplicadas na UFSC. Desse modo, uma série de procedimentos são apontados para se trabalhar na detecção e prevenção de plágio por meio da criação e adoção de políticas anti-plágio na referida instituição. Tais políticas incluem o estabelecimento de regras específicas, a formação de uma estrutura institucional para se lidar com casos de plágio, a oferta de cursos sobre escrita acadêmica, e o uso orientado de ferramentas eletrônicas de detecção de similaridade textual. Além disso, é discutido na tese a diferença entre plágio intencional e não-intencional, e também se menciona a existência de estratégias que se utilizam para ocultar plágio (especialmente intencional), como a tradução. Foi considerado que seria importante enfatizar outros aspectos que não o caráter ético, importante e geralmente priorizado nas discussões sobre plágio. Embora tal aspecto seja relevante, ele escapa daquilo que professores e linguistas podem ajudar a solucionar. Os resultados apontam para a necessidade de se adotar medidas de longo prazo na educação, como por meio do ensino de escrita acadêmica, e também medidas de mais curto prazo, como a implementação de políticas anti-plágio em instituições de ensino superior. Tais medidas podem proporcionar um meio mais efetivo de se combater plágio no meio acadêmico

    Enhancing computer-aided plagiarism detection

    Get PDF

    Asimilación y diseminación de los artículos derivados de las tesis doctorales en medicina en la literatura científica

    Get PDF
    [ES] El objetivo de la tesis consiste en investigar las publicaciones derivadas de las tesis doctorales de medicina. La investigación examina el texto completo de una colección de tesis y artículos publicados por los mismos autores con el fin de determinar a través de un análisis de similitud textual los artículos procedentes de las tesis doctorales, denominados artículos derivados. Específicamente, la investigación persigue delimitar las características de estos artículos derivados en términos de publicación, difusión y asimilación en la literatura científica. El análisis de similitud textual utiliza la organización discursiva de las secciones de los artículos, IMRaD (Introducción, Metodología, Resultados, Discusión) y Referencias. La aplicación informática empleada para detectar automáticamente la similitud textual entre tesis y artículos es el programa anti-plagio Turnitin. Para determinar qué secciones discursivas resultan predictivas en la identificación de los artículos derivados se utilizaron análisis de inferencia estadística. Esta tesis desarrolla el estudio desde tres perspectivas: identificación de los artículos, basado en el análisis de similitud textual, modelos de diseminación, medidos por indicadores de visibilidad de las revistas en la que se publican los artículos y modelos de asimilación, medidos por indicadores basados en las citas recibidas por los artículos en la literatura científica. La tesis analiza la predictibilidad potencial de las referencias bibliográficas en la identificación de los artículos derivados. Se comparan los resultados obtenidos del análisis de similitud textual basado en el contenido con el correspondiente a la métrica de la sección de referencias bibliográficas. En conjunto, la investigación examina una serie de cuestiones relacionadas con los artículos derivados, tema que, a nuestro entender, ha sido poco investigado en la literatura científica. Esta tesis, que tiene como objetivo la identificación de los artículos derivados, podría tener aplicaciones potenciales para evaluar las publicaciones de los doctorandos y proporcionar una visión de la producción científica surgida de las tesis doctorales a las universidades
    corecore