276 research outputs found

    Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

    Full text link
    Plagiarism detection is one of the most researched areas among the Natural Language Processing(NLP) community. A good plagiarism detection covers all the NLP methods including semantics, named entities, paraphrases etc. and produces detailed plagiarism reports. Detection of Cross Lingual Plagiarism requires deep knowledge of various advanced methods and algorithms to perform effective text similarity checking. Nowadays the plagiarists are also advancing themselves from hiding the identity from being catch in such offense. The plagiarists are bypassed from being detected with techniques like paraphrasing, synonym replacement, mismatching citations, translating one language to another. Image Content Plagiarism Detection (ICPD) has gained importance, utilizing advanced image content processing to identify instances of plagiarism to ensure the integrity of image content. The issue of plagiarism extends beyond textual content, as images such as figures, graphs, and tables also have the potential to be plagiarized. However, image content plagiarism detection remains an unaddressed challenge. Therefore, there is a critical need to develop methods and systems for detecting plagiarism in image content. In this paper, the system has been implemented to detect plagiarism form contents of Images such as Figures, Graphs, Tables etc. Along with statistical algorithms such as Jaccard and Cosine, introducing semantic algorithms such as LSA, BERT, WordNet outperformed in detecting efficient and accurate plagiarism.Comment: NLPTT2023 publication, 10 Page

    Using deep learning models for learning semantic text similarity of Arabic questions

    Get PDF
    Question-answering platforms serve millions of users seeking knowledge and solutions for their daily life problems. However, many knowledge seekers are facing the challenge to find the right answer among similar answered questions and writer’s responding to asked questions feel like they need to repeat answers many times for similar questions. This research aims at tackling the problem of learning the semantic text similarity among different asked questions by using deep learning. Three models are implemented to address the aforementioned problem: i) a supervised-machine learning model using XGBoost trained with pre-defined features, ii) an adapted Siamese-based deep learning recurrent architecture trained with pre-defined features, and iii) a Pre-trained deep bidirectional transformer based on BERT model. Proposed models were evaluated using a reference Arabic dataset from the mawdoo3.com company. Evaluation results show that the BERT-based model outperforms the other two models with an F1=92.99%, whereas the Siamese-based model comes in the second place with F1=89.048%, and finally, the XGBoost as a baseline model achieved the lowest result of F1=86.086%

    Using plagiarism detection software : the other side of the coin

    Get PDF
    The conclusions of this article are the result of a study conducted over three years, based on the expertise files that the author established as a scientific collaborator of the current IRAFPA. The use of similarity detection software was systematic for each case. The aim of this article is to demonstrate the absurdity of a persistent belief in universities: that it would be sufficient to call on the services of a computer services company specialising in so-called "anti-plagiarism" software to curb such cases. We will show, by example, what can and cannot be expected of them, and then we will compare the two most widespread in France, Urkund and Compilatio

    TAKSONOMIJA METODA AKADEMSKOG PLAGIRANJA

    Get PDF
    The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text documents, for which it gives an overview of current classifications and taxonomies and then proposes a more comprehensive classification according to several criteria: their origin and purpose, technical implementation, consequence, complexity of detection and according to the number of linguistic sources. The article suggests the new classification of academic plagiarism, describes sorts and methods of plagiarism, types and categories, approaches and phases of plagiarism detection, the classification of methods and algorithms for plagiarism detection. The title of the article explicitly targets the academic community, but it is sufficiently general and interdisciplinary, so it can be useful for many other professionals like software developers, linguists and librarians.Rad daje pregled domene plagiranja tekstnih dokumenata. Opisuje porijeklo pojma plagijata, daje prikaz definicija te objašnjava plagijatu srodne pojmove. Ukazuje na širinu domene plagiranja, a za tekstne dokumenate daje pregled dosadašnjih taksonomija i predlaže sveobuhvatniju taksonomiju prema više kriterija: porijeklu i namjeni, tehničkoj provedbi plagiranja, posljedicama plagiranja, složenosti otkrivanja i (više)jezičnom porijeklu. Rad predlaže novu klasifikaciju akademskog plagiranja, prikazuje vrste i metode plagiranja, tipove i kategorije plagijata, pristupe i faze otkrivanja plagiranja. Potom opisuje klasifikaciju metoda i algoritama otkrivanja plagijata. Iako cilja na akademskog čitatelja, može biti od koristi u interdisciplinarnim područjima te razvijateljima softvera, lingvistima i knjižničarima

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    A Wolf in Sheep’s Clothing? Critical Discourse Analysis of Five Online Automated Paraphrasing Sites

    Get PDF
    Research on academic integrity used to focus more on student character and behaviour. Now this research includes wider viewing of this issue as a current teaching and learning challenge which requires pedagogical intervention. It is now the responsibility of staff and institutions to treat the creation of a learning environment supporting academic integrity as a teaching and learning priority. Plagiarism by simply copying other people’s work is a well-known misconduct which undermines academic integrity; moreover, technological developments have evolved plagiarism to include the generation and copying of computer-generated text. Automated paraphrasing tool (APT) websites have become increasingly common, offering students machine-generated rephrased text that students input from their own or others’ writing. These developments present a creeping erosion of academic integrity under the guise of legitimate academic assistance. This also has implications for arrival of large language model (LLM) generative AI tools. In accessing these sites, students must discern what is a legitimate use of the tool and what may constitute breaching academic integrity. This study critically analysed the text from five online paraphrasing websites to examine the discourses used to legitimise and encourage APT use in both appropriate and inappropriate ways. We conceptualised these competing discourses using Sheep and Wolf metaphors. In addition, we offer a metaphor of the Educator as a Shepherd to become aware of APT website claims and assist students to develop critical language awareness when exposed to these sites. Educators can assist students with this through knowledge of how these sites use language to entice users to circumvent learning

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Academic integrity : a call to research and action

    Get PDF
    Originally published in French:L'urgence de l'intégrité académique, Éditions EMS, Management & société, Caen, 2021 (ISBN 978-2-37687-472-0).The urgency of doing complements the urgency of knowing. Urgency here is not the inconsequential injunction of irrational immediacy. It arises in various contexts for good reasons, when there is a threat to the human existence and harms to others. Today, our knowledge based civilization is at risk both by new production models of knowledge and by the shamelessness of knowledge delinquents, exposing the greatest number to important risks. Swiftly, the editors respond to the diagnostic by setting up a reference tool for academic integrity. Across multiple dialogues between the twenty-five chapters and five major themes, the ethical response shapes pragmatic horizons for action, on a range of disciplinary competencies: from science to international diplomacy. An interdisciplinary work indispensable for teachers, students and university researchers and administrators
    corecore