6 research outputs found
Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase
[EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve
considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321
Towards Detecting Textual Plagiarism Using Machine Learning Methods
Masteroppgave informasjons- og kommunikasjonsteknologi - Universitetet i Agder, 2015Textual plagiarism is passing off someone else’s text as your own. The current
state of the art in plagiarism detection performs well, but often uses a series of
manually determined thresholds of metrics in order to determine whether an author
is guilty of performing plagiarism or not. These thresholds are optimized for
a single data set and are not optimal for all situations or forms of plagiarism. The
detection methodologies also require a professional familiar with the algorithms
in order to be properly adjusted, due to their complexity. Using a pre-classified
data set, machine learning methods allow teachers and censors without knowledge
of the methodology to use a plagiarism detection tool specifically designed
for their needs.
This thesis demonstrates that a methodology using machine learning, without
the need to set thresholds, can match, and in some cases surpass, the top methodologies
in the current state of the art. With more work, future methodologies may
possibly outperform both the best commercial and freely available methodologies
Plagiarism detection for Indonesian texts
As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts.
To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation.
Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts.
Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm
Plagiarism detection for Indonesian texts
As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts.
To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation.
Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts.
Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm
Métodos eficientes de deteção de plágio em grandes corpora
O crescente aumento da quantidade de informação publicada na Web, na forma de publicações
literárias, científicas e académicas, implica uma constante verificação da integridade de novos
documentos (suspeitos) em função dos documentos existentes (fonte). Surge, portanto, a necessidade
de aumentar: a eficiência na redução do espaço de procura em grandes conjuntos de
documentos fonte; a eficácia na deteção de plágios cada vez mais sofisticados. Nesta dissertação
descreve-se uma metodologia baseada em dois atos: (i) indexação do corpus fonte, com um
motor de pesquisa (código aberto), e extração de documentos fonte (candidatos), através de
pesquisa por palavras relevantes e caraterísticas textuais; (ii) localização de excertos de plágio
em documentos suspeitos, com uma métrica robusta, criada através da aplicação de programação
genética sobre as caraterísticas de dados plagiados. Os resultados experimentais obtidos
mostram uma redução significativa no tempo de processamento, devido à estratificação do corpus,
assim como a capacidade de detetar eficientemente excertos de plágio literal, modificado
e ofuscado.The increasing information volume published in the Web, either in terms of literary publications
or scientific and academic papers, requires a constant surveillance to verify the integrity of
daily entering new documents (suspicious), on the basis of the existing ones (sources). As a
consequence arises the need to improve the efficiency in reducing the search space for large
sets of documents source and the effectiveness in detecting increasingly sophisticated plagiarism
events. In this dissertation it is described a methodology based on two actions: (I) indexing the
source corpus, with a search engine (open-source), and the extraction of source documents
(candidates) by searching for key relevant words and textual features; (II) locating plagiarized
passages in suspicious documents with a hybrid metric created by applying genetic programming
on the characteristics of plagiarized data. The results show a significant reduction in processing
time due to the corpus stratification, as well as a high success rate in detecting plagiarism
passages, having none, low, and high obfuscation. The experimental results show a significant
reduction in processing time due to stratification of the corpus, as well as the ability to detect
plagiarism extracts of diffrent kind: literal, modified and obfuscated
Métodos de Deteção Automática de Plágio Extrínseco em Textos de Grande Dimensão
A prática de plágio em documentos, livros e na arte de forma geral, tem consequência gravas na
sociedade. A existência de pessoas sem honestidade, na academia, na indústria, na imprensa
que se apropriam da propriedade intelectual de outrem, levou algumas organizações a produzirem
normas de combate ao plágio e adotarem meios tecnológicas para enfrentar e evitar a
propagação deste mal.
Os sistemas de Deteção Automática de Plágio (DAP) são, sem dúvida, os principais meios utilizadas
para identificação de situações que envolvem a prática de plágio em documentos de texto
disponíveis na Web.
Para tentar ofuscar a atitude fraudulenta (omitir o plágio) em um documento de texto de grande
dimensão, os praticantes de plágio, algumas vezes extraem curtas frases, sendo consequentemente
manipuladas e transformadas de voz ativa para passiva e vice-versa, bem como os léxicos
transformados em sinónimos e antónimos [ASA12, AIAA15, ASI+17]. Por outra, com pares de
texto1 de maior tamanho, o processo de alinhamento textual é fastidioso, que o torna menos
eficiente e até menos eficaz, sobretudo, se existir tentativa de ofuscação.
Este trabalho tinha como objetivo propor métodos de DAP menos complexos que tornam o processo
da Análise Detalhada mais eficiente e com melhor eficácia. Para tal, desenvolvemos
dois métodos de DAP primeiramente, um método de deteção plágio que utiliza uma abordagem
de segmentação recursiva do documento fonte em três blocos, afim de identificar pequenos e
grandes segmentos plagiados com paráfrases com eficácia e alto nível de eficiência temporal.
O segundo método proposto é o de Pesquisa de Plágio por Scanning Vetorial. Este método utiliza
word embeeding (word2vec) sem recurso aos cálculos matriciais, e é capaz de detetar quer
pequenos segmentos plagiados, quer segmentos grandes, mesmo com alto nível de ofuscação
de forma eficiente e com alto nível de eficácia.
Os resultados que apresentados no Capítulo 4 demonstram a eficácia e a eficiência dos métodos
propostos nesta dissertação.The existence of people without honesty, in the academy, in the industry, in the press that
appropriates the intellectual property of others, led some organizations to produce norms to
combat plagiarism and to adopt technological means to confront and to prevent the propagation
of this evil. Plagiarism Automatic Detectiors (PAD) systems are undoubtedly the main means used
to identify situations involving the practice of plagiarism in text documents available in Web.
To attempt to obfuscate the fraudulent attitude (omitting plagiarism) in a large text document,
plagiarists sometimes extract short phrases and are consequently manipulated and transformed
from active to passive and vice versa, as well as lexicons transformed into synonyms and antonyms
[ASA12, AIAA15, ASI+17]. On the other, with pairs of text 2 Of larger size, the process
of text alignment is tedious, which makes it less efficient and even less effective, especially if
there is an attempt to obfuscate.
This work aimed to propose less complex PAD methods that make the Detailed Analysis process
more efficient and with better efficiency. For this, we developed two methods of PAD first, a
plagiarism detection method that uses a recursive segmentation approach of the source document
in three blocks, in order to identify small and large segments plagiarized with efficacious
paraphrases and high level of temporal efficiency. The second proposed method is the Plagiarism
Research by Vector Scanning). This method uses word embeedings (word2vec) without
recourse to matrix calculations, and is capable of detecting either small plagiarized segments or
large segments, even with high level of obfuscation efficiently and with high level of efficiency.
The results presented in Chapter 4 demonstrate the efficacy and efficiency of the methods
proposed in this dissertation