37 research outputs found

    Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model

    Get PDF
    Highly obfuscated plagiarism cases contain unseen and obfuscated texts, which pose difficulties when using existing plagiarism detection methods. A fuzzy semantic-based similarity model for uncovering obfuscated plagiarism is presented and compared with five state-of-the-art baselines. Semantic relatedness between words is studied based on the part-of-speech (POS) tags and WordNet-based similarity measures. Fuzzy-based rules are introduced to assess the semantic distance between source and suspicious texts of short lengths, which implement the semantic relatedness between words as a membership function to a fuzzy set. In order to minimize the number of false positives and false negatives, a learning method that combines a permission threshold and a variation threshold is used to decide true plagiarism cases. The proposed model and the baselines are evaluated on 99,033 ground-truth annotated cases extracted from different datasets, including 11,621 (11.7%) handmade paraphrases, 54,815 (55.4%) artificial plagiarism cases, and 32,578 (32.9%) plagiarism-free cases. We conduct extensive experimental verifications, including the study of the effects of different segmentations schemes and parameter settings. Results are assessed using precision, recall, F-measure and granularity on stratified 10-fold cross-validation data. The statistical analysis using paired t-tests shows that the proposed approach is statistically significant in comparison with the baselines, which demonstrates the competence of fuzzy semantic-based model to detect plagiarism cases beyond the literal plagiarism. Additionally, the analysis of variance (ANOVA) statistical test shows the effectiveness of different segmentation schemes used with the proposed approach

    Plagiarism detection techniques

    Get PDF
    Academic dishonesty is one of the critical measures to evaluate research papers, theses and students’ assignments. Therefore, plagiarism detection is an area of concern for many researchers especially in the academic field. Other fields such as plagiarized news, magazine articles and web resources are also area of concern. In that regard, many detection techniques and tools have been developed to address the problem of plagiarism. Different types of texts require different techniques to detect plagiarism. Documents to be retrieved, searched and thence judged according to the existence of plagiarism can be classified into two types: programming source code documents and natural language documents. The first type of documents is programming source code. Several researches have been developed for source code plagiarism detection or so-called code clones detection (John et al. 1981; Sam 1981; Marguerite et al. 1988; Parker et al. 1989; Wise 1992; Edward 2001a, 2001b; Shauna 2001; Belkhouche et al. 2004; Kim and Choi 2005; Mike et al. 2005; Mozgovoy et al. 2005; Peter and Julian 2005; Seunghak and Iryoung 2005; Chao et al. 2006; Christian and Tahaghoghi 2006; Samuel and Zelda 2006; Son et al. 2006; Jeong-Hoon et al. 2007; Lingxiao et al. 2007). This type of documents has specific structure which is language dependent. The word “language” here refers to one of the programming languages such as FORTRAN, PASCAL, C, JAVA and many more. Thus, the detection algorithm is based on what programming language is used. Most of the early techniques were used for one programming language. For instance, John et al. (1981) developed plagiarism detection system for FORTRAN source code, Sam (1981) developed a tool that detect plagiarism in PASCAL programs and some other systems that can be found in the literature. In addition, there exist other techniques used to detect code clones in two or more programming languages. For example, Whale (1990) developed a system called Plague that works with Pascal and Prolog source code. Xin et al (2004) developed SID system (Shared Information Distance) which supports Java and C++ source code. Early code clone detection techniques focus on keeping track of metrics such as number of lines, variables, statements, subprograms, call to subprograms and other parameters. However, current research makes a quantum leap and uses the structure or style of the source code. Thus, such technique is called stylometric (i.e. based on the style or structure) since some research has also been involved to use this technique in natural language plagiarism detection. The latest trends for code clone detection use artificial neural networks (Steve et al., 2007) in which neural networks were trained based on some common features of the submitted documents. The network input uses number of metrics as input unites. The network output with low error rate can measure how relevance two documents are. In brief, code clones detection techniques aim to locate plagiarized code in one or more programming language(s) and rely on either metrics or style/structure of the code. The second type of documents is natural language documents written in English, Arabic or any other languages. Detecting plagiarism in this type of documents is much more difficult than the first type because natural languages are not easy to be modeled. In contrast to code clone detection techniques, neither metrics nor structures can be maintained easily in natural language documents. Although the research of detecting plagiarism started more than a decade after the first type (1981 for code clones vs. 1997 for natural language documents), many applicable techniques and useful tools have been developed for plagiarism detection in natural language documents (Antonio et al. 1997; Culwin et al. 2001; Zaslavsky et al. 2001; Monostori et al. 2002; Bao et al. 2003; Bao et al. 2004; Daniel and Mike 2004; Weir et al. 2004; Xin et al. 2004; Ye et al. 2004; Heon et al. 2005; Hui and Jamie 2005; Stefan and Stuart 2005; Yerra and Ng 2005; Bao et al. 2006a; Bao et al. 2006b; Byung-Ryul et al. 2006; Eissen and Stein 2006; Hui and Jamie 2006; Kang et al. 2006; Koberstein and Ng 2006; Manuel et al. 2006; Sebastian and Thomas 2006; Sorokina et al. 2006; Benno et al. 2007; Liu et al. 2007; Meyer zu Eissen et al. 2007; Rehurek 2007; Romans et al. 2007; Steve et al. 2007). The following sections discuss different representations of natural language documents for use in plagiarism detection

    Understanding plagiarism linguistic patterns, textual features, and detection methods

    No full text
    Plagiarism can be of many different natures, ranging from copying texts to adopting ideas, without giving credit to its originator. This paper presents a new taxonomy of plagiarism that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist’s behavioral point of view. The taxonomy supports deep understanding of different linguistic patterns in committing plagiarism, for example, changing texts into semantically equivalent but with different words and organization, shortening texts with concept generalization and specification, and adopting ideas and important contributions of others. Different textual features that characterize different plagiarism types are discussed. Systematic frameworks and methods of monolingual, extrinsic, intrinsic, and cross-lingual plagiarism detection are surveyed and correlated with plagiarism types, which are listed in the taxonomy. We conduct extensive study of state-of-the-art techniques for plagiarism detection, including character n-gram-based (CNG), vector-based (VEC), syntax-based (SYN), semantic-based (SEM), fuzzy-based (FUZZY), structuralbased (STRUC), stylometric-based (STYLE), and cross-lingual techniques (CROSS).Our study corroborates that existing systems for plagiarism detection focus on copying text but fail to detect intelligent plagiarism when ideas are presented in different words

    Deciphering the Efficacy of No-Attention Architectures in Computed Tomography Image Classification: A Paradigm Shift

    No full text
    The burgeoning domain of medical imaging has witnessed a paradigm shift with the integration of AI, particularly deep learning, enhancing diagnostic precision and expediting the analysis of Computed Tomography (CT) images. This study introduces an innovative Multilayer Perceptron-driven model, DiagnosticMLP, which sidesteps the computational intensity of attention-based mechanisms, favoring a no-attention architecture that leverages Fourier Transforms for global information capture and spatial gating units for local feature emphasis. This study’s methodology encompasses a sophisticated augmentation and patching strategy at the input level, followed by a series of MLP blocks designed to extract hierarchical features and spatial relationships, culminating in a global average pooling layer before classification. Evaluated against state-of-the-art MLP-based models including MLP-Mixer, FNet, gMLP, and ResMLP across diverse and extensive CT datasets, including abdominal, and chest scans, DiagnosticMLP demonstrated a remarkable ability to converge efficiently, with competitive accuracy, F1 scores, and AUC metrics. Notably, in datasets featuring kidney and abdomen disorders, the model showcased superior generalization capabilities, underpinned by its unique design that addresses the complexity inherent in CT imaging. The findings in terms of accuracy and precision-recall balance posit DiagnosticMLP as an exceptional outperforming alternative to attention-reliant models, paving the way for streamlined, efficient, and scalable AI tools in medical diagnostics, reinforcing the potential for AI-augmented precision medicine without the dependency on attention-based architectures

    On the use of fuzzy information retrieval for gauging similarity of arabic documents

    No full text
    As one of the richest human languages in terms of words constructions and diversity of meanings, judging similarity amongst statements in Arabic documents is complex. In this paper, we present a mechanism for gauging similarity of Arabic documents using fuzzy IR model. Similarity degree of two documents is the averaged similarity among statements treated as equal although they have been restructured or reworded. We introduced some fuzzy similarity sets such as near duplicate, very similar, similar, slightly similar, dissimilar and very dissimilar. These similarity sets can be implemented as a spectrum of values ranges from 1 (duplicate) and 0 (different). Our corpus collection has been built in which all stop words were removed and nonstop words were stemmed using typical Arabic IR techniques. The corpora has 100 documents with 4477 statements and 54346 non-stop-word, stemmed words in total. Another 15 query documents with 303 statements and 1620 words were specifically constructed for our test. Experimental results show that fuzzy IR can be used to define the extent documents are similar or dissimilar, where similarity can be mapped to one of the proposed fuzzy sets. The performance of our fuzzy IR system, measured in fuzzy precision and fuzzy recall, shows that it outperforms Boolean IR in retrieving more documents that have similar content but with different synonyms

    Evaluation of an Arabic Chatbot Based on Extractive Question-Answering Transfer Learning and Language Transformers

    No full text
    Chatbots are programs with the ability to understand and respond to natural language in a way that is both informative and engaging. This study explored the current trends of using transformers and transfer learning techniques on Arabic chatbots. The proposed methods used various transformers and semantic embedding models from AraBERT, CAMeLBERT, AraElectra-SQuAD, and AraElectra (Generator/Discriminator). Two datasets were used for the evaluation: one with 398 questions, and the other with 1395 questions and 365,568 documents sourced from Arabic Wikipedia. Extensive experimental works were conducted, evaluating both manually crafted questions and the entire set of questions by using confidence and similarity metrics. Our experimental results demonstrate that combining the power of transformer architecture with extractive chatbots can provide more accurate and contextually relevant answers to questions in Arabic. Specifically, our experimental results showed that the AraElectra-SQuAD model consistently outperformed other models. It achieved an average confidence score of 0.6422 and an average similarity score of 0.9773 on the first dataset, and an average confidence score of 0.6658 and similarity score of 0.9660 on the second dataset. The study concludes that the AraElectra-SQuAD showed remarkable performance, high confidence, and robustness, which highlights its potential for practical applications in natural language processing tasks for Arabic chatbots. The study suggests that the language transformers can be further enhanced and used for various tasks, such as specialized chatbots, virtual assistants, and information retrieval systems for Arabic-speaking users

    Multi-Slice Generation sMRI and fMRI for Autism Spectrum Disorder Diagnosis Using 3D-CNN and Vision Transformers

    No full text
    Researchers have explored various potential indicators of ASD, including changes in brain structure and activity, genetics, and immune system abnormalities, but no definitive indicator has been found yet. Therefore, this study aims to investigate ASD indicators using two types of magnetic resonance images (MRI), structural (sMRI) and functional (fMRI), and to address the issue of limited data availability. Transfer learning is a valuable technique when working with limited data, as it utilizes knowledge gained from a pre-trained model in a domain with abundant data. This study proposed the use of four vision transformers namely ConvNeXT, MobileNet, Swin, and ViT using sMRI modalities. The study also investigated the use of a 3D-CNN model with sMRI and fMRI modalities. Our experiments involved different methods of generating data and extracting slices from raw 3D sMRI and 4D fMRI scans along the axial, coronal, and sagittal brain planes. To evaluate our methods, we utilized a standard neuroimaging dataset called NYU from the ABIDE repository to classify ASD subjects from typical control subjects. The performance of our models was evaluated against several baselines including studies that implemented VGG and ResNet transfer learning models. Our experimental results validate the effectiveness of the proposed multi-slice generation with the 3D-CNN and transfer learning methods as they achieved state-of-the-art results. In particular, results from 50-middle slices from the fMRI and 3D-CNN showed a profound promise in ASD classifiability as it obtained a maximum accuracy of 0.8710 and F1-score of 0.8261 when using the mean of 4D images across the axial, coronal, and sagittal. Additionally, the use of the whole slices in fMRI except the beginnings and the ends of brain views helped to reduce irrelevant information and showed good performance of 0.8387 accuracy and 0.7727 F1-score. Lastly, the transfer learning with the ConvNeXt model achieved results higher than other transformers when using 50-middle slices sMRI along the axial, coronal, and sagittal planes
    corecore