6 research outputs found

    Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

    Full text link
    [EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321

    Towards Detecting Textual Plagiarism Using Machine Learning Methods

    Get PDF
    Masteroppgave informasjons- og kommunikasjonsteknologi - Universitetet i Agder, 2015Textual plagiarism is passing off someone else’s text as your own. The current state of the art in plagiarism detection performs well, but often uses a series of manually determined thresholds of metrics in order to determine whether an author is guilty of performing plagiarism or not. These thresholds are optimized for a single data set and are not optimal for all situations or forms of plagiarism. The detection methodologies also require a professional familiar with the algorithms in order to be properly adjusted, due to their complexity. Using a pre-classified data set, machine learning methods allow teachers and censors without knowledge of the methodology to use a plagiarism detection tool specifically designed for their needs. This thesis demonstrates that a methodology using machine learning, without the need to set thresholds, can match, and in some cases surpass, the top methodologies in the current state of the art. With more work, future methodologies may possibly outperform both the best commercial and freely available methodologies

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Métodos eficientes de deteção de plágio em grandes corpora

    Get PDF
    O crescente aumento da quantidade de informação publicada na Web, na forma de publicações literárias, científicas e académicas, implica uma constante verificação da integridade de novos documentos (suspeitos) em função dos documentos existentes (fonte). Surge, portanto, a necessidade de aumentar: a eficiência na redução do espaço de procura em grandes conjuntos de documentos fonte; a eficácia na deteção de plágios cada vez mais sofisticados. Nesta dissertação descreve-se uma metodologia baseada em dois atos: (i) indexação do corpus fonte, com um motor de pesquisa (código aberto), e extração de documentos fonte (candidatos), através de pesquisa por palavras relevantes e caraterísticas textuais; (ii) localização de excertos de plágio em documentos suspeitos, com uma métrica robusta, criada através da aplicação de programação genética sobre as caraterísticas de dados plagiados. Os resultados experimentais obtidos mostram uma redução significativa no tempo de processamento, devido à estratificação do corpus, assim como a capacidade de detetar eficientemente excertos de plágio literal, modificado e ofuscado.The increasing information volume published in the Web, either in terms of literary publications or scientific and academic papers, requires a constant surveillance to verify the integrity of daily entering new documents (suspicious), on the basis of the existing ones (sources). As a consequence arises the need to improve the efficiency in reducing the search space for large sets of documents source and the effectiveness in detecting increasingly sophisticated plagiarism events. In this dissertation it is described a methodology based on two actions: (I) indexing the source corpus, with a search engine (open-source), and the extraction of source documents (candidates) by searching for key relevant words and textual features; (II) locating plagiarized passages in suspicious documents with a hybrid metric created by applying genetic programming on the characteristics of plagiarized data. The results show a significant reduction in processing time due to the corpus stratification, as well as a high success rate in detecting plagiarism passages, having none, low, and high obfuscation. The experimental results show a significant reduction in processing time due to stratification of the corpus, as well as the ability to detect plagiarism extracts of diffrent kind: literal, modified and obfuscated

    Métodos de Deteção Automática de Plágio Extrínseco em Textos de Grande Dimensão

    Get PDF
    A prática de plágio em documentos, livros e na arte de forma geral, tem consequência gravas na sociedade. A existência de pessoas sem honestidade, na academia, na indústria, na imprensa que se apropriam da propriedade intelectual de outrem, levou algumas organizações a produzirem normas de combate ao plágio e adotarem meios tecnológicas para enfrentar e evitar a propagação deste mal. Os sistemas de Deteção Automática de Plágio (DAP) são, sem dúvida, os principais meios utilizadas para identificação de situações que envolvem a prática de plágio em documentos de texto disponíveis na Web. Para tentar ofuscar a atitude fraudulenta (omitir o plágio) em um documento de texto de grande dimensão, os praticantes de plágio, algumas vezes extraem curtas frases, sendo consequentemente manipuladas e transformadas de voz ativa para passiva e vice-versa, bem como os léxicos transformados em sinónimos e antónimos [ASA12, AIAA15, ASI+17]. Por outra, com pares de texto1 de maior tamanho, o processo de alinhamento textual é fastidioso, que o torna menos eficiente e até menos eficaz, sobretudo, se existir tentativa de ofuscação. Este trabalho tinha como objetivo propor métodos de DAP menos complexos que tornam o processo da Análise Detalhada mais eficiente e com melhor eficácia. Para tal, desenvolvemos dois métodos de DAP primeiramente, um método de deteção plágio que utiliza uma abordagem de segmentação recursiva do documento fonte em três blocos, afim de identificar pequenos e grandes segmentos plagiados com paráfrases com eficácia e alto nível de eficiência temporal. O segundo método proposto é o de Pesquisa de Plágio por Scanning Vetorial. Este método utiliza word embeeding (word2vec) sem recurso aos cálculos matriciais, e é capaz de detetar quer pequenos segmentos plagiados, quer segmentos grandes, mesmo com alto nível de ofuscação de forma eficiente e com alto nível de eficácia. Os resultados que apresentados no Capítulo 4 demonstram a eficácia e a eficiência dos métodos propostos nesta dissertação.The existence of people without honesty, in the academy, in the industry, in the press that appropriates the intellectual property of others, led some organizations to produce norms to combat plagiarism and to adopt technological means to confront and to prevent the propagation of this evil. Plagiarism Automatic Detectiors (PAD) systems are undoubtedly the main means used to identify situations involving the practice of plagiarism in text documents available in Web. To attempt to obfuscate the fraudulent attitude (omitting plagiarism) in a large text document, plagiarists sometimes extract short phrases and are consequently manipulated and transformed from active to passive and vice versa, as well as lexicons transformed into synonyms and antonyms [ASA12, AIAA15, ASI+17]. On the other, with pairs of text 2 Of larger size, the process of text alignment is tedious, which makes it less efficient and even less effective, especially if there is an attempt to obfuscate. This work aimed to propose less complex PAD methods that make the Detailed Analysis process more efficient and with better efficiency. For this, we developed two methods of PAD first, a plagiarism detection method that uses a recursive segmentation approach of the source document in three blocks, in order to identify small and large segments plagiarized with efficacious paraphrases and high level of temporal efficiency. The second proposed method is the Plagiarism Research by Vector Scanning). This method uses word embeedings (word2vec) without recourse to matrix calculations, and is capable of detecting either small plagiarized segments or large segments, even with high level of obfuscation efficiently and with high level of efficiency. The results presented in Chapter 4 demonstrate the efficacy and efficiency of the methods proposed in this dissertation
    corecore