Search CORE

720 research outputs found

Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection

Author: Karnalim Oscar
Sulistiani Lisan
Publication venue
Publication date: 28/10/2018
Field of study

To solve time inefficiency issue, only potential pairs are compared in string-matching-based source code plagiarism detection; wherein potentiality is defined through a fast-yet-order-insensitive similarity measurement (adapted from Information Retrieval) and only pairs which similarity degrees are higher or equal to a particular threshold is selected. Defining such threshold is not a trivial task considering the threshold should lead to high efficiency improvement and low effectiveness reduction (if it is unavoidable). This paper proposes two thresholding mechanisms---namely range-based and pair-count-based mechanism---that dynamically tune the threshold based on the distribution of resulted similarity degrees. According to our evaluation, both mechanisms are more practical to be used than manual threshold assignment since they are more proportional to efficiency improvement and effectiveness reduction.Comment: The 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS

arXiv.org e-Print Archive

Viewing functions as token sequences to highlight similarities in source code

Author: Chilowicz Michel
Duris Étienne
Roussel Gilles
Publication venue: 'Elsevier BV'
Publication date: 01/10/2013
Field of study

International audienceThe detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

The similarity metric

Author: Chen Xin
Li Ming
Li Xin
Ma Bin
Vitanyi Paul
Publication venue
Publication date: 01/01/2003
Field of study

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

arXiv.org e-Print Archive

CiteSeerX

Hunting for Pirated Software Using Metamorphic Analysis

Author: Rana Hardikkumar
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2014
Field of study

In this paper, we consider the problem of detecting software that has been pirated and modified. We analyze a variety of detection techniques that have been previously studied in the context of malware detection. For each technique, we empirically determine the detection rate as a function of the degree of modification of the original code. We show that the code must be greatly modified before we fail to reliably distinguish it, and we show that our results offer a significant improvement over previous related work. Our approach can be applied retroactively to any existing software and hence, it is both practical and effective

SJSU ScholarWorks

Using algorithmic information theory and stochastic modeling to improve classification and evolutionary computation: a dissertation submitted to the department of computer science of the Universidad Autonoma de Madrid in partial fulfillment of the requirements for the degree of doctor of philosophy

Author: Cebrián Ramos Manuel
Publication venue
Publication date: 01/01/2007
Field of study

Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, junio de 200

Detecting and Analyzing Text Reuse with BLAST

Author: Vesanto Aleksi
Publication venue
Publication date: 15/01/2019
Field of study

In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters