720 research outputs found

    Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection

    Full text link
    To solve time inefficiency issue, only potential pairs are compared in string-matching-based source code plagiarism detection; wherein potentiality is defined through a fast-yet-order-insensitive similarity measurement (adapted from Information Retrieval) and only pairs which similarity degrees are higher or equal to a particular threshold is selected. Defining such threshold is not a trivial task considering the threshold should lead to high efficiency improvement and low effectiveness reduction (if it is unavoidable). This paper proposes two thresholding mechanisms---namely range-based and pair-count-based mechanism---that dynamically tune the threshold based on the distribution of resulted similarity degrees. According to our evaluation, both mechanisms are more practical to be used than manual threshold assignment since they are more proportional to efficiency improvement and effectiveness reduction.Comment: The 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS

    Viewing functions as token sequences to highlight similarities in source code

    Get PDF
    International audienceThe detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level

    The similarity metric

    Full text link
    A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

    Hunting for Pirated Software Using Metamorphic Analysis

    Get PDF
    In this paper, we consider the problem of detecting software that has been pirated and modified. We analyze a variety of detection techniques that have been previously studied in the context of malware detection. For each technique, we empirically determine the detection rate as a function of the degree of modification of the original code. We show that the code must be greatly modified before we fail to reliably distinguish it, and we show that our results offer a significant improvement over previous related work. Our approach can be applied retroactively to any existing software and hence, it is both practical and effective

    Detecting and Analyzing Text Reuse with BLAST

    Get PDF
    In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters
    corecore