2,502 research outputs found

    A new corpus for the evaluation of arabic intrinsic plagiarism detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40802-1_6The present paper introduces the first corpus for the evaluation of Arabic intrinsic plagiarism detection. The corpus consists of 1024 artificial suspicious documents in which 2833 plagiarism cases have been inserted automatically from source documentsThis work is the result of the collaboration in the framework of the bilateral research project AECID-PCI AP/043848/11 (Application of Natural Language Processing to the Need of the University) between the Universitat Politècnica de València in Spain and Constantine 2 University in AlgeriaBensalem, I.; Rosso, P.; Chikhi, S. (2013). A new corpus for the evaluation of arabic intrinsic plagiarism detection. En Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer Verlag (Germany). 53-58. https://doi.org/10.1007/978-3-642-40802-1_6S5358Springer Policy on Publishing Integrity. Guidelines for Journal EditorsPotthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9 (2009)Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for Plagiarism Detection. In: Huang, C.-R., Jurafsky, D. (eds.) Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pp. 997–1005. ACL (2010)Potthast, M., Barrón-cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd International Competition on Plagiarism Detection. In: Braschler, M., Harman, D. (eds.) Notebook Papers of CLEF 2010 LABs and Workshops (2010)Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd International Competition on Plagiarism Detection. In: Petras, V., Forner, P., Clough, P. (eds.) Notebook Papers of CLEF 2011 LABs and Workshops (2011)Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th International Competition on Plagiarism Detection. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs and Workshop –Working Notes Papers (2012)Juola, P.: An Overview of the Traditional Authorship Attribution Subtask Notebook for PAN at CLEF 2012. In: Forner, P., Karlgren, J., and Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs and Workshop –Working Notes Papers (2012)Yakout, M.M.: Examples of Plagiarism in Scientific and Cultural Communities (in Arabic), http://www.yaqout.net/ba7s_4.htmlAbbasi, A., Chen, H.: Applying Authorship Analysis to Arabic Web Content. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen, H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 183–197. Springer, Heidelberg (2005)Shaker, K., Corne, D.: Authorship Attribution in Arabic using a hybrid of evolutionary search and linear discriminant analysis. In: 2010 UK Workshop on Computational Intelligence (UKCI), pp. 1–6. IEEE (2010)Ouamour, S., Sayoud, H.: Authorship attribution of ancient texts written by ten arabic travelers using a SMO-SVM classifier. In: 2012 International Conference on Communications and Information Technology (ICCIT), pp. 44–47. IEEE (2012)Bensalem, I., Rosso, P., Chikhi, S.: Intrinsic Plagiarism Detection in Arabic Text: Preliminary Experiments. In: Berlanga, R., Rosso, P. (eds.) 2nd Spanish Conference on Information Retrieval (CERI 2012), Valencia (2012)Jadalla, A., Elnagar, A.: A Plagiarism Detection System for Arabic Text-Based Documents. In: Chau, M., Wang, G.A., Yue, W.T., Chen, H. (eds.) PAISI 2012. LNCS, vol. 7299, pp. 145–153. Springer, Heidelberg (2012)Alzahrani, S., Salim, N.: Statement-Based Fuzzy-Set Information Retrieval versus Fingerprints Matching for Plagiarism Detection in Arabic Documents. In: 5th Postgraduate Annual Research Seminar (PARS 2009), Johor Bahru, Malaysia, pp. 267–268 (2009)Menai, M.E.B.: Detection of Plagiarism in Arabic Documents. International Journal of Information Technology and Computer Science 10, 80–89 (2012)Jaoua, M., Jaoua, F.K., Hadrich Belguith, L., Ben Hamadou, A.: Automatic Detection of Plagiarism in Arabic Documents Based on Lexical Chains. Arab Computer Society Journal 4, 1–11 (2011) (in Arabic)Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In: 51st Annual Meeting of the Association of Computational Linguistics (ACL 2013). ACM (to appear, 2013)Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Language Resources and Evaluation 45, 63–82 (2010)Bensalem, I., Rosso, P., Chikhi, S.: Building Arabic Corpora from Wikisource. In: 10th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2013). IEEE (2013

    Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase

    Full text link
    [EN] The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different type of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.The work of Paolo Rosso was partially funded by the Spanish MICINN under the research Project MISMIS-FAKEn-HATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).Gharavi, E.; Veisi, H.; Rosso, P. (2020). Scalable and Language-Independent Embedding-based Approach for Plagiarism Detection Considering Obfuscation Type: No Training Phase. Neural Computing and Applications. 32(14):10593-10607. https://doi.org/10.1007/s00521-019-04594-yS1059310607321

    Semantic Similarity Analysis for Paraphrase Identification in Arabic Texts

    Get PDF

    Intelligent Plagiarism Detection for Electronic Documents

    Get PDF
    Plagiarism detection is the process of finding similarities on electronic based documents. Recently, this process is highly required because of the large number of available documents on the internet and the ability to copy and paste the text of relevant documents with simply Control+C and Control+V commands. The proposed solution is to investigate and develop an easy, fast, and multi-language support plagiarism detector with the easy of one click to detect the document plagiarism. This process will be done with the support of intelligent system that can learn, change and adapt to the input document and make a cross-fast search for the content on the local repository and the online repository and link the content of the file with the matching content everywhere found. Furthermore, the supported document type that we will use is word, text and in some cases, the pdf files –where is the text can be extracting from them- and this made possible by using the DLL file from Word application that Microsoft provided on OS. The using of DLL will let us to not constrain on how to get the text from files; and will help us to apply the file on our Delphi project and walk throw our methodology and read the file word by word to grantee the best working scenarios for the calculation. In the result, this process will help in the uprising the documents quality and enhance the writer experience related to his work and will save the copyrights for the official writer of the documents by providing a new alternative tool for plagiarism detection problem for easy and fast use to the concerned Institutions for free

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    A Robust System for Local Reuse Detection of Arabic Text on the Web

    Get PDF
    We developed techniques for finding local text reuse on the Web, with an emphasis on the Arabic language. That is, our objective is to develop text reuse detection methods that can detect alternative versions of the same information and focus on exploring the feasibility of employing text reuse detection methods on the Web. The results of this research can be thought of as rich tools to information analysts for corporate and intelligence applications. Such tools will become essential parts in validating and assessing information coming from uncertain origins. These tools will prove useful for detecting reuse in scientific literature too. It is also the time for ordinary Web users to become Fact Inspectors by providing a tool that allows people to quickly check the validity and originality of statements and their sources, so they will be given the opportunity to perform their own assessment of information quality. Local text reuse detection can be divided into two major subtasks: the first subtask is the retrieval of candidate documents that are likely to be the original sources of a given document in a collection of documents and then performing an extensive pairwise comparison between the given document and each of the possible sources of text reuse that have been retrieved. For this purpose, we develop a new technique to address the challenging problem of candidate documents retrieval from the Web. Given an input document d, the problem of local text reuse detection is to detect from a given documents collection, all the possible reused passages between d and the other documents. Comparing the passages of document d with the passages of every other document in the collection is obviously infeasible especially with large collections such as the Web. Therefore, selecting a subset of the documents that potentially contains reused text with d becomes a major step in the detection problem. In the setting of the Web, the search for such candidate source documents is usually performed through limited query interface. We developed a new efficient approach of query formulation to retrieve Arabic-based candidate source documents from the Web. The candidate documents are then fed to a local text reuse detection system for detailed similarity evaluation with d. We consider the candidate source document retrieval problem as an essential step in the detection of text reuse. Several techniques have been previously proposed for detecting text reuse, however, these techniques have been designed for relatively small and homogeneous collections. Furthermore, we are not aware of any actual previous work on Arabic text reuse detection on the Web. This is due to complexity of the Arabic language as well as the heterogeneity of the information contained on the Web and its large scale that makes the task of text reuse detection on the Web much more difficult than in relatively small and homogeneous collections. We evaluated the work using a collection of documents especially constructed and downloaded from the Web for the evaluation of Web documents retrieval in particular and the detailed text reuse detection in general. Our work to a certain degree is exploratory rather than definitive, in that this problem has not been investigated before for Arabic documents at the Web scale. However, our results show that the methods we described are applicable for Arabic-based reuse detection in practice. The experiments show that around 80% of the Web documents used in the reused cases were successfully retrieved. As for the detailed similarity analysis, the system achieved an overall score of 97.2% based on the precision and recall evaluation metrics

    Issues Related to the Detection of Source Code Plagiarism in Students Assignments

    Get PDF
    Detecting similarity or plagiarism in the academic research publications, source code, etc. has been a long time complex and time consuming task. Several algorithms, tools and websites exist that try to find plagiarism or possible plagiarism in those human creative products. In this paper we used source code plagiarism detection tools to assess the level of plagiarism in source codes. We also investigated issues related to accuracy and challenges in detecting possible plagiarism in students\u27 assignments. In a second study, we evaluated some tools against detecting possible plagiarism in research papers. Results showed that such process or decision is not binary to make and that subjectivity is high. In addition, there is a need to tune plagiarism detection tools to give criticality or weights by users of those tools to categorize and classify different levels of seriousness for committing plagiarism
    • …
    corecore