8 research outputs found

    Approaches for Candidate Document Retrieval

    Get PDF
    Plagiarism has become a serious problem mainly because of the electronically available documents. An online document retrieval is weighty part of a modern anti-plagiarism tool. This paper describes an architecture and concepts of a real-world document retrieval system, which is a part of a general anti-plagiarism software. A similar system was developed as a part of nationwide plagiarism solution at Masaryk University. The design can be adapted into many situations. Provided recommendation stem from experience of the system operation for several years. The proper usage of such systems contributes to gradual improvement of the quality of student theses

    Heterogeneous Queries for Synoptic and Phrasal Search

    Get PDF
    This paper describes our approaches for the Plagiarism Detection – Source Retrieval task of PAN 2014. We combined and improved methodology used at PAN 2012 and PAN 2013. Our system combines three types of queries: The keywords-based queries; the paragraph-based queries; and the headers-based queries. The queries are distinguished also by other properties such as the phrase query or the positional query. The queries are submitted to two search engines – Chatnoir and Indri – according to their properties. The query’s position serves for the search control, minimization of the total number of executed queries is the system’s priority. Downloaded documents are textually compared with the suspicious document and if a similarity is found, the downloaded document is reported

    Improving Synoptic Querying for Source Retrieval

    Get PDF
    Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery.Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery

    Determining Window Size from Plagiarism Corpus for Stylometric Features

    Get PDF
    The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature.The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature

    Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection

    Get PDF
    In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems.V tomto článku popisujeme náš přístup v soutěži PAN 2012 v detekci plagiátorství. V první části, vyhledávání podezřelých dokumentů, jsme použili přístup založený na extrakci tří odlišných typů Webových dotazů a aplikovali heuristiku pro minimalizaci celkového počtu použitých dotazů na základě nalezených podobností dokumentů. Jednotlivé typy dotazů byly vytvořeny z klíčových slov dokumentu, z částí textu detekovaných metodou pro detekci vnitřního plagiátorství a na základě lokálních nadpisů v textu. Tato metodika pro vyhledávání podezřelých dokumentů byla v rámci soutěže nejefektivnější. Náš systém pro detailní porovnávání párů dokumentů je založen na hledání výskytu společných vlastností (například společné skupiny slov), přičemž systém vyhodnocuje společné vlastnosti více různých typů. Náš finální výsledek byl založen na dvou typech vlastností: setříděné pětice slov a nesetříděné osmice stop-slov. Navrhujeme metodu výpočtu takzvaných platných rozsahů na základě těchto společných vlastností, kde platný rozsah je reprezentován svým počátečním znakem a délkou jak ve zdrojovém, tak v podezřelém dokumentu. Předchozí práce používaly pro reprezentaci vzdálenosti pořadí výskytu jednotlivých společných vlastností. Toto není použitelné pro systém s více typy vlastností, které nemusejí mít mezi sebou navzájem žádné přirozené uspořádání. Z těchto platných rozsahů počítáme výsledné detekované pasáže textu ve fázi následného zpracování, kde se snažíme slučovat blízké platné rozsahy a odstraňovat některé typy překrývajících se rozsahů. Dále rozebíráme jiné přístupy které jsme vyzkoušeli, ale nepoužili v našem finálním výsledku. V tomto článku také diskutujeme výkonnostní aspekty našeho programu, nastavení parametrů, a relevantnost kritérií hodnocení PAN 2012 (včetně hodnoty plagdet) pro reálné systémy na odhalování plagiátů.In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems

    Institutional Repository Driven by Access Rights as a Part of Plagiarism Detection Systems

    Get PDF
    Masaryk University (MU) has developed an institutional repository with plagiarism detection service as an extension of the university information system (IS). The repository enables various options of storing research and eventually publishes it in accordance with copyrights. Setting the access mode is managed by approval process support in the repository. Therefore, the university had to set the rules and processes for proposing and approving the access modes in order to be able to set the proper access rights. The article advocates the hypothesis that the implementation of the university repository must focus not only on technical tasks, but also on methodological tasks. The paper describes both tasks and also the benefits of institutional repository driven by access rights deployment, where some files can be hidden for common users. Our approach is based on the idea that even the inaccessible files are usable in limited access mode and valuable sources for plagiarism detection tools and related services.Masaryk University (MU) has developed an institutional repository with plagiarism detection service as an extension of the university information system (IS). The repository enables various options of storing research and eventually publishes it in accordance with copyrights. Setting the access mode is managed by approval process support in the repository. Therefore, the university had to set the rules and processes for proposing and approving the access modes in order to be able to set the proper access rights. The article advocates the hypothesis that the implementation of the university repository must focus not only on technical tasks, but also on methodological tasks. The paper describes both tasks and also the benefits of institutional repository driven by access rights deployment, where some files can be hidden for common users. Our approach is based on the idea that even the inaccessible files are usable in limited access mode and valuable sources for plagiarism detection tools and related services

    Electronegativity Equalization Method: Parameterization and Validation for Large Sets of Organic, Organohalogene and Organometal Molecule

    No full text
    Abstract: The Electronegativity Equalization Method (EEM) is a fast approach for charge calculation. A challenging part of the EEM is the parameterization, which is performed using ab initio charges obtained for a set of molecules. The goal of our work was to perform the EEM parameterization for selected sets of organic, organohalogen and organometal molecules. We have performed the most robust parameterization published so far. The EEM parameterization was based on 12 training sets selected from a database of predicted 3D structures (NCI DIS) and from a database of crystallographic structures (CSD). Each set contained from 2000 to 6000 molecules. We have shown that the number of molecules in the training set is very important for quality of the parameters. We have improved EEM parameters (STO-3G MPA charges) for elements that were already parameterized

    Academic Plagiarism Detection

    No full text
    corecore