Search CORE

8 research outputs found

Approaches for Candidate Document Retrieval

Author: Brandejs Michal
Suchomel Šimon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Plagiarism has become a serious problem mainly because of the electronically available documents. An online document retrieval is weighty part of a modern anti-plagiarism tool. This paper describes an architecture and concepts of a real-world document retrieval system, which is a part of a general anti-plagiarism software. A similar system was developed as a part of nationwide plagiarism solution at Masaryk University. The design can be adapted into many situations. Provided recommendation stem from experience of the system operation for several years. The proper usage of such systems contributes to gradual improvement of the quality of student theses

Crossref

Univerzitní repozitář Masarykovy univerzity

Heterogeneous Queries for Synoptic and Phrasal Search

Author: Brandejs Michal
Suchomel Šimon
Publication venue: CEUR, Aachen University
Publication date: 01/01/2014
Field of study

This paper describes our approaches for the Plagiarism Detection – Source Retrieval task of PAN 2014. We combined and improved methodology used at PAN 2012 and PAN 2013. Our system combines three types of queries: The keywords-based queries; the paragraph-based queries; and the headers-based queries. The queries are distinguished also by other properties such as the phrase query or the positional query. The queries are submitted to two search engines – Chatnoir and Indri – according to their properties. The query’s position serves for the search control, minimization of the total number of executed queries is the system’s priority. Downloaded documents are textually compared with the suspicious document and if a similarity is found, the downloaded document is reported

Univerzitní repozitář Masarykovy univerzity

Improving Synoptic Querying for Source Retrieval

Author: Brandejs Michal
Suchomel Šimon
Publication venue: CEUR
Publication date: 01/01/2015
Field of study

Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery.Source retrieval is a part of plagiarism discovery process, where only a selected set of candidate documents are retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes used methodology and the architecture of source retrieval system developed for PAN 2015 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on our previous systems used at PAN since 2012. The majority of features were adopted with some improvements described in this paper. The paper analyzes used methodology and discuss the queries performance. The paper provides explanation for many implementation settings in the source retrieval process. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery

Univerzitní repozitář Masarykovy univerzity

Determining Window Size from Plagiarism Corpus for Stylometric Features

Author: Brandejs Michal
Suchomel Šimon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature.The sliding window concept is a common method for computing a profile of a document with unknown structure. This paper outlines an experiment with stylometric word-based feature in order to determine an optimal size of the sliding window. It was conducted for a vocabulary richness method called ‘average word frequency class’ using the PAN 2015 source retrieval training corpus for plagiarism detection. The paper shows the pros and cons of the stop words removal for the sliding window document profiling and discusses the utilization of the selected feature for intrinsic plagiarism detection. The experiment resulted in the recommendation of setting the sliding windows to around 100 words in length for computing the text profile using the average word frequency class stylometric feature

Crossref

Univerzitní repozitář Masarykovy univerzity

Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection

Author: Brandejs Michal
Kasprzak Jan
Suchomel Šimon
Publication venue: Univesity "La Sapienza"
Publication date: 01/01/2012
Field of study

In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems.V tomto článku popisujeme náš přístup v soutěži PAN 2012 v detekci plagiátorství. V první části, vyhledávání podezřelých dokumentů, jsme použili přístup založený na extrakci tří odlišných typů Webových dotazů a aplikovali heuristiku pro minimalizaci celkového počtu použitých dotazů na základě nalezených podobností dokumentů. Jednotlivé typy dotazů byly vytvořeny z klíčových slov dokumentu, z částí textu detekovaných metodou pro detekci vnitřního plagiátorství a na základě lokálních nadpisů v textu. Tato metodika pro vyhledávání podezřelých dokumentů byla v rámci soutěže nejefektivnější. Náš systém pro detailní porovnávání párů dokumentů je založen na hledání výskytu společných vlastností (například společné skupiny slov), přičemž systém vyhodnocuje společné vlastnosti více různých typů. Náš finální výsledek byl založen na dvou typech vlastností: setříděné pětice slov a nesetříděné osmice stop-slov. Navrhujeme metodu výpočtu takzvaných platných rozsahů na základě těchto společných vlastností, kde platný rozsah je reprezentován svým počátečním znakem a délkou jak ve zdrojovém, tak v podezřelém dokumentu. Předchozí práce používaly pro reprezentaci vzdálenosti pořadí výskytu jednotlivých společných vlastností. Toto není použitelné pro systém s více typy vlastností, které nemusejí mít mezi sebou navzájem žádné přirozené uspořádání. Z těchto platných rozsahů počítáme výsledné detekované pasáže textu ve fázi následného zpracování, kde se snažíme slučovat blízké platné rozsahy a odstraňovat některé typy překrývajících se rozsahů. Dále rozebíráme jiné přístupy které jsme vyzkoušeli, ale nepoužili v našem finálním výsledku. V tomto článku také diskutujeme výkonnostní aspekty našeho programu, nastavení parametrů, a relevantnost kritérií hodnocení PAN 2012 (včetně hodnoty plagdet) pro reálné systémy na odhalování plagiátů.In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems

Univerzitní repozitář Masarykovy univerzity

Institutional Repository Driven by Access Rights as a Part of Plagiarism Detection Systems

Author: Brandejs Michal
Jakubík Daniel
Lunter Ľuboš
Suchomel Šimon
Publication venue: Mendel University in Brno
Publication date: 01/01/2017
Field of study

Masaryk University (MU) has developed an institutional repository with plagiarism detection service as an extension of the university information system (IS). The repository enables various options of storing research and eventually publishes it in accordance with copyrights. Setting the access mode is managed by approval process support in the repository. Therefore, the university had to set the rules and processes for proposing and approving the access modes in order to be able to set the proper access rights. The article advocates the hypothesis that the implementation of the university repository must focus not only on technical tasks, but also on methodological tasks. The paper describes both tasks and also the benefits of institutional repository driven by access rights deployment, where some files can be hidden for common users. Our approach is based on the idea that even the inaccessible files are usable in limited access mode and valuable sources for plagiarism detection tools and related services.Masaryk University (MU) has developed an institutional repository with plagiarism detection service as an extension of the university information system (IS). The repository enables various options of storing research and eventually publishes it in accordance with copyrights. Setting the access mode is managed by approval process support in the repository. Therefore, the university had to set the rules and processes for proposing and approving the access modes in order to be able to set the proper access rights. The article advocates the hypothesis that the implementation of the university repository must focus not only on technical tasks, but also on methodological tasks. The paper describes both tasks and also the benefits of institutional repository driven by access rights deployment, where some files can be hidden for common users. Our approach is based on the idea that even the inaccessible files are usable in limited access mode and valuable sources for plagiarism detection tools and related services

Univerzitní repozitář Masarykovy univerzity

Electronegativity Equalization Method: Parameterization and Validation for Large Sets of Organic, Organohalogene and Organometal Molecule

Author: Jakub Vaněk
Jaroslav Koča
Radka Svobodová Vařeková
Zuzana Jiroušková
Šimon Suchomel
Publication venue
Publication date: 01/07/2007
Field of study

Abstract: The Electronegativity Equalization Method (EEM) is a fast approach for charge calculation. A challenging part of the EEM is the parameterization, which is performed using ab initio charges obtained for a set of molecules. The goal of our work was to perform the EEM parameterization for selected sets of organic, organohalogen and organometal molecules. We have performed the most robust parameterization published so far. The EEM parameterization was based on 12 training sets selected from a database of predicted 3D structures (NCI DIS) and from a database of crystallographic structures (CSD). Each set contained from 2000 to 6000 molecules. We have shown that the number of molecules in the training set is very important for quality of the parameters. We have improved EEM parameters (STO-3G MPA charges) for elements that were already parameterized

CiteSeerX

Directory of Open Access Journals

Academic Plagiarism Detection

Author: Abnar Samira
Alberts Houda
Alfikri Zakiy Firdaus
Alvi Faisal
Alzahrani Salha
An Vo Ngoc Phuoc
Asghari Habibollah
Bagnall Douglas
Bagnall Douglas
Bartoli Alberto
Bela Gipp
Bensalem Imene
Billah Nagoudi El Moatez
Bobicev Victoria
Buscaldi Davide
Castillo Esteban
Castro Daniel
Ceska Zdenek
Chudá Daniela
Dan Avishek
Dawn Arnav Kumar
Dharani T.
Diego
Ehsan Nava
Elizalde Victoria
Elizalde Victoria
Esteki Fezeh
Fagan Jody Condit
Feng Vanessa Wei
Fishman Teddi
Franco-Salvador Marc
Fréry Jordan
Gabrilovich Evgeniy
García-Mondeja Yasmany
Garg Urvashi
Ghaeini M. R.
Gharavi Erfaneh
Gillam Lee
Gipp Bela
Glinos Demetrios G.
Goutte Cyril
Gross Philipp
Gupta Deepa
Gutierrez Josue
Gómez-Adorno Helena
Gómez-Adorno Helena
Hagen Matthias
Haggag Osama
Halvani Oren
Halvani Oren
Halvani Oren
Halvani Oren
Harvey Sarah
Hussain
Hussein Ashraf S.
Hürlimann Manuela
Ibnu Subroto Imam Much
Jankowska Magdalena
Jankowska Magdalena
Jayapal Arun
Jiffriya M. A. C.
Juola Patrick
Juola Patrick
Kanjirangat Vani
Kanjirangat Vani
Kanjirangat Vani
Karaś Daniel
Kern Roman
Khan Imtiaz H.
Khan Jamal Ahmad
Khonji Mahmoud
Khoshnavataher Khadijeh
Kocher Mirco
Kocher Mirco
Kocher Mirco
Kong Leilei
Kong Leilei
Kong Leilei
Kong Leilei
Kuznetsov Mikhail
Layton Robert
Ledesma Paola
Lee Taemin
Magooda Ahmed
Mahgoub Ashraf Y.
Maitra Promita
Mayor Cristhian
Modaresi Pashutan
Mohebbi Majid
Momtaz Mozhgan
Moreau Erwan
Moreau Erwan
Moreau Erwan
Norman Meuschke
Pacheco María Leonor
Palkovskii Yurii
Palkovskii Yurii
Pertile Solange
Petmanson Timo
Pilehvar Mohammad Taher
Posadas-Durán Juan-Pablo
Potthast Martin
Potthast Martin
Potthast Martin
Potthast Martin
Potthast Martin
Prakash Amit
Rafiei Javad
Rakian Shima
Ravi N. Riya
Rexha Andi
Riya Ravi N
Rodríguez Torrejón Diego Antonio
Safin Kamil
Saini Anuj
Sanchez-Perez Miguel A
Sanchez-Perez Miguel A.
Sari Yunita
Sari Yunita
Schmidt Andreas
Seidman Shachar
Shrestha Prasha
Shrestha Prasha
Siddiqui Muazzam Ahmed
Sittar Abdul
Soori Hussein
Stamatatos Efstathios
Stamatatos Efstathios
Stamatatos Efstathios
Suchomel Šimon
Suchomel Šimon
Suchomel Šimon
Sánchez-Vega Fernando
Tomáš Foltýnek
Tschuggnall Michael
van Dam Michiel
Vartapetiance Anna
Veselý Ondřej
Vilariño Darnes
Wang Shuai
Wibowo Agung Toto
Williams Kyle
Williams Kyle
Williams Kyle
Yao Xuchen
Zmiycharov Valentin
Zubarev Denis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref