3,833 research outputs found

    A Machine Learning Approach for Plagiarism Detection

    Get PDF
    Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions
    • …
    corecore