12 research outputs found

    Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection

    Full text link
    To solve time inefficiency issue, only potential pairs are compared in string-matching-based source code plagiarism detection; wherein potentiality is defined through a fast-yet-order-insensitive similarity measurement (adapted from Information Retrieval) and only pairs which similarity degrees are higher or equal to a particular threshold is selected. Defining such threshold is not a trivial task considering the threshold should lead to high efficiency improvement and low effectiveness reduction (if it is unavoidable). This paper proposes two thresholding mechanisms---namely range-based and pair-count-based mechanism---that dynamically tune the threshold based on the distribution of resulted similarity degrees. According to our evaluation, both mechanisms are more practical to be used than manual threshold assignment since they are more proportional to efficiency improvement and effectiveness reduction.Comment: The 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS

    TF-IDF Inspired Detection for Cross-Language Source Code Plagiarism and Collusion

    Get PDF
    Several computing courses allow students to choose which programming language they want to use for completing a programming task. This can lead to cross-language code plagiarism and collusion, in which the copied code file is rewritten in another programming language. In response to that, this paper proposes a detection technique which is able to accurately compare code files written in various programming languages, but with limited effort in accommodating such languages at development stage. The only language-dependent feature used in the technique is source code tokeniser and no code conversion is applied. The impact of coincidental similarity is reduced by applying a TF-IDF inspired weighting, in which rare matches are prioritised. Our evaluation shows that the technique outperforms common techniques in academia for handling language conversion disguises. Further, it is comparable to those techniques when dealing with conventional disguises

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    Software Plagiarism Detection Using N-grams

    Get PDF
    Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The motivations behind plagiarism can vary from completing academic courses to even gaining economical advantage. Plagiarism exists in various domains, where people want to take credit from something they have worked on. These areas can include e.g. literature, art or software, which all have a meaning for an authorship. In this thesis we conduct a systematic literature review from the topic of source code plagiarism detection methods, then based on the literature propose a new approach to detect plagiarism which combines both similarity detection and authorship identification, introduce our tokenization method for the source code, and lastly evaluate the model by using real life data sets. The goal for our model is to point out possible plagiarism from a collection of documents, which in this thesis is specified as a collection of source code files written by various authors. Our data, which we will use to our statistical methods, consists of three datasets: (1) collection of documents belonging to University of Helsinki’s first programming course, (2) collection of documents belonging to University of Helsinki’s advanced programming course and (3) submissions for source code re-use competition. Statistical methods in this thesis are inspired by the theory of search engines, which are related to data mining when detecting similarity between documents and machine learning when classifying document with the most likely author in authorship identification. Results show that our similarity detection model can be used successfully to retrieve documents for further plagiarism inspection, but false positives are quickly introduced even when using a high threshold that controls the minimum allowed level of similarity between documents. We were unable to use the results of authorship identification in our study, as the results with our machine learning model were not high enough to be used sensibly. This was possibly caused by the high similarity between documents, which is due to the restricted tasks and the course setting that teaches a specific programming style during the timespan of the course

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    “Navegar é preciso” : o Continuum Experiencial de Programação com a Web

    Get PDF
    Esta tese busca compreender a valorização da busca na Web e as implicações dessa experiência para estudantes universitários de Computação na resolução de problemas de programação. A partir da percepção de experiência, revelada no cenário da prática de programação, apoiada pela Web, desdobra-se esta pesquisa em dois caminhos investigativos: a ação do estudante e a operação do meio. O entendimento da disponibilidade da solução de programação mobiliza a discussão sobre a cópia, já o ambiente compelidor da Web põe em pauta o processo da busca. Sendo assim, o percurso realizado nesta tese, começa em Buscando o lugar da experiência, no qual apresentamos a preocupação inicial a partir da fala de um estudante - “Se não está na Internet, não existe” - no contexto da disciplina de Estrutura de Dados. Como trabalho científico, analisamos estudos relacionados que discutem o plágio do código-fonte e o efeito descalibrado do próprio conhecimento, promovido pelo acesso à Web. Na continuação, nos aproximamos da preocupação inicial com a teoria, defendida por John Dewey, para formular o entendimento de experiência e discutir os princípios de continuidade e de interação. Também, trouxemos para a discussão Jorge Larrosa, Alberto Cupani, Marshall McLuhan e William Powers. No andamento, problematizamos o objeto da pesquisa e as condições do contexto Web, reafirmamos a necessidade de pensá-lo como algo a refletir e considerar. Em Buscando significados, apresentamos o quadro metodológico que inclui os dados de um questionário quantitativo, aplicado a estudantes universitários (n=149 respostas), o qual fornece indícios acerca da valorização da experiência com a Web, sustentada na importância, satisfação e indispensabilidade que atribuem ao meio. Os resultados são aprofundados na etapa qualitativa, na qual é possível perceber as implicações da relação com o meio nas entrevistas com estudantes que participaram de atividades de desafios de programação. Assim, conduzimos a tese, no sentido de uma concepção autêntica de que navegar é preciso, mas é também necessário estarmos atentos à situação de cada experiência, pois serve de instrumento para a promoção do continuum experiencial. Observamos que os estudantes entendem a sua experiência com a Web como positiva e são cientes das boas e más implicações desta. Porém, ao contrário do que pensávamos em nosso pressuposto, os estudantes abrem mão de parte da consciência crítica e do desenvolvimento educativo da experiência por vantagens que caracterizam um agir tecnológico no tempo, nos protocolos e nos hábitos.Understanding the implications of the Web experience for university students of Computing, in solving programming problems, is the purpose of this study. Based on the perception of experience revealed in the scenario of web-based programming practice, this research unfolds in two investigative paths: student action and the operation of the environment. The understanding of the availability of the programming solution mobilizes the discussion about copying, and Web’s compelling environment that puts the search process into question. The study begins with the presentation of the initial concern from a student speech – "If it isn’t on the Internet, it doesn’t exist." – in the context of the Data Structure discipline. An analysis of related studies that discuss the plagiarism of the source code and the de-calibrated effect of the own knowledge promoted by the Web access is carried out. The initial concern is approached with the theory supported by John Dewey, Jorge Larrosa, Alberto Cupani, Marshall McLuhan and William Powers. As the study moves forward, the research object and the conditions of the context are problematized so as to reaffirm the need to think of it as something to be reflect and considered. The method includes data from a questionnaire applied to university students (n = 149 answers), that shows evidence about the Web experience heps solving programming problems. Thus, this proposal leads to an authentic conception that navigation is necessary, but that it is also necessary to be attentive to the situation of each experience, since it will serve as an instrument to promote the experiential continuum - principle discussed by Dewey. After delineating the research object, the methodological framework is proposed to define the conditions under which we can talk about the object and to advance in order to legitimate it. We note that students see their Web experience as positive and are aware of the good and bad implications. However, contrary to what we thought, students give up on their critical awareness and the educational development experience in exchange for advantages that characterize an essentially technological action in time, protocols and habits

    The Partial Order Kernel and its Application to Understanding the Regulatory Grammar of Conserved Non-coding Elements

    Get PDF
    PhDConserved non-coding elements (CNEs) are regions of non-coding DNA which have remained evolutionarily conserved across various species over millions of years and are found to cluster near genes involved in early embryonic development, suggesting that they play an important role as regulatory elements. Indeed, many CNEs have been shown to act as enhancers; however, not all regulatory elements are conserved and in some cases, deletion of CNEs did not result in any notable phenotypes. These opposing ndings indicate that the functions of CNEs are still poorly understood and further research on these elements is needed to uncover the reasons for their extreme conservation. The aim of this thesis is to investigate the use and development of algorithms for decoding the regulatory grammar of CNEs. Initially, an assessment of several methods for functional classi cation of CNEs is provided. The results obtained using these methods are validated by functional assays and their limitations in capturing the grammar of CNEs are discussed. Motivated by these limitations, a partial order graph representation of the sequence of transcription factor binding sites (TFBSs) in a CNE that allows e cient handling of the overlapping sites is introduced. A dynamic programming-based method for aligning two such graphs and identifying regulatory signatures composed of co-occurring TFBSs is proposed and evaluated. The results demonstrate the predictive ability of this method, which can be used to prioritise regions for experimental validation. Building on this method, the partial order kernel (POKer) for comparison of strings containing alternative substrings and represented by partial order graphs is introduced. The POKer is evaluated in di erent sequence comparison tasks, including visual localisation. An approach using the POKer for functional classi cation of CNEs is introduced and its e ectiveness in capturing the grammar of CNEs is demonstrated. Finally, the implications of the results presented in this work for modelling the evolution of CNEs are discussed
    corecore