815 research outputs found

    Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods

    Full text link
    Measuring the similarity of short written contexts is a fundamental problem in Natural Language Processing. This article provides a unifying framework by which short context problems can be categorized both by their intended application and proposed solution. The goal is to show that various problems and methodologies that appear quite different on the surface are in fact very closely related. The axes by which these categorizations are made include the format of the contexts (headed versus headless), the way in which the contexts are to be measured (first-order versus second-order similarity), and the information used to represent the features in the contexts (micro versus macro views). The unifying thread that binds together many short context applications and methods is the fact that similarity decisions must be made between contexts that share few (if any) words in common.Comment: 23 page

    TAKSONOMIJA METODA AKADEMSKOG PLAGIRANJA

    Get PDF
    The article gives an overview of the plagiarism domain, with focus on academic plagiarism. The article defines plagiarism, explains the origin of the term, as well as plagiarism related terms. It identifies the extent of the plagiarism domain and then focuses on the plagiarism subdomain of text documents, for which it gives an overview of current classifications and taxonomies and then proposes a more comprehensive classification according to several criteria: their origin and purpose, technical implementation, consequence, complexity of detection and according to the number of linguistic sources. The article suggests the new classification of academic plagiarism, describes sorts and methods of plagiarism, types and categories, approaches and phases of plagiarism detection, the classification of methods and algorithms for plagiarism detection. The title of the article explicitly targets the academic community, but it is sufficiently general and interdisciplinary, so it can be useful for many other professionals like software developers, linguists and librarians.Rad daje pregled domene plagiranja tekstnih dokumenata. Opisuje porijeklo pojma plagijata, daje prikaz definicija te objašnjava plagijatu srodne pojmove. Ukazuje na širinu domene plagiranja, a za tekstne dokumenate daje pregled dosadašnjih taksonomija i predlaže sveobuhvatniju taksonomiju prema više kriterija: porijeklu i namjeni, tehničkoj provedbi plagiranja, posljedicama plagiranja, složenosti otkrivanja i (više)jezičnom porijeklu. Rad predlaže novu klasifikaciju akademskog plagiranja, prikazuje vrste i metode plagiranja, tipove i kategorije plagijata, pristupe i faze otkrivanja plagiranja. Potom opisuje klasifikaciju metoda i algoritama otkrivanja plagijata. Iako cilja na akademskog čitatelja, može biti od koristi u interdisciplinarnim područjima te razvijateljima softvera, lingvistima i knjižničarima

    An improved extrinsic monolingual plagiarism detection approach of the Bengali text

    Get PDF
    Plagiarism is an act of literature fraud, which is presenting others’ work or ideas without giving credit to the original work. All published and unpublished written documents are under the cover of this definition. Plagiarism, which increased significantly over the last few years, is a concerning issue for students, academicians, and professionals. Due to this, there are several plagiarism detection tools or software available to detect plagiarism in different languages. Unfortunately, negligible work has been done and no plagiarism detection software available in the Bengali language where Bengali is one of the most spoken languages in the world. In this paper, we have proposed a plagiarism detection tool for the Bengali language that mainly focuses on the educational and newspaper domain. We have collected 82 textbooks from the National Curriculum of Textbooks (NCTB), Bangladesh, scrapped all articles from 12 reputed newspapers and compiled our corpus with more than 10 million sentences. The proposed method on Bengali text corpus shows an accuracy rate of 97.31

    Semantically-informed distance and similarity measures for paraphrase plagiarism identification

    Full text link
    [EN] Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution.This work was partially supported by CONACYT under scholarship 401887, project grants 257383, 258588 and 2016-01-2410 and under the Thematic Networks program (Language Technologies Thematic Network project 281795). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (Prometeo II/2014/030).Álvarez Carmona, M.; Franco-Salvador, M.; Villatoro-Tello, E.; Montes Gomez, M.; Rosso, P.; Villaseñor Pineda, L. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems. 34(5):2983-2990. https://doi.org/10.3233/JIFS-169483S29832990345Abdi, A., Idris, N., Alguliyev, R. M., & Aliguliyev, R. M. (2015). PDLK: Plagiarism detection using linguistic knowledge. Expert Systems with Applications, 42(22), 8936-8946. doi:10.1016/j.eswa.2015.07.048Barrón-Cedeño, A., Vila, M., Martí, M., & Rosso, P. (2013). Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Computational Linguistics, 39(4), 917-947. doi:10.1162/coli_a_00153Biggins S. , Mohammed S. and Oakley S. , University of shefield: Two approaches to semantic text similarity, In First Joint Conference on Lexical and Computational Semantics (SEM at NAACL 2012), Montreal, Canada, 2012, pp. 655–661.Chatterjee K. , Henzinger T.A. , Ibsen-Jensen R. and Otop J. , Edit distance for pushdown automata. arXiv preprint arXiv:1504.08259, 2015.Cheng J. and Kartsaklis D. , Syntax-aware multi-sense word embeddings for deep compositional models of meaning, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015 pp. 1531–1542.Courtney C. and Mihalcea R. , Measuring the semantic similarity of texts, In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment (EMSEE at NAALC 2005), 2005, pp. 13–18.Dolan W.B. and Brockett C. , Automatically constructing a corpus of sentential paraphrases, In Proc of IWP, 2005.H.Gomaa, W., & A. Fahmy, A. (2013). A Survey of Text Similarity Approaches. International Journal of Computer Applications, 68(13), 13-18. doi:10.5120/11638-7118Hoad, T. C., & Zobel, J. (2003). Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology, 54(3), 203-215. doi:10.1002/asi.10170Kim S. , Wilbur W.J. and Lu Z. , Bridging the gap:Asemantic similarity measure between queries and documents., arXiv preprint arXiv:1608.01972, 2016.Lukashenko R. , Graudina V. and Grundspenkis J. , Computerbased plagiarism detection methods and tools: An overview, In Proceedings of the 2007 International Conference on Computer Systems and Technologies, 2007, p. 40 ACM.Miller, G. A. (1995). WordNet. Communications of the ACM, 38(11), 39-41. doi:10.1145/219717.219748Palkovskii Y. , Belov A. and Muzyka I. , Using wordnet-based semantic similarity measurement in external plagiarism detection, In Notebook for PAN at CLEF’11, 2011.Pandey, A., Kaur, M., & Goyal, P. (2015). The menace of plagiarism: How to detect and curb it. 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services. doi:10.1109/ettlis.2015.7048213Stamatatos, E. (2011). Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12), 2512-2527. doi:10.1002/asi.21630Wu Z. and Palmer M. , Verbs semantics and lexical selection, In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, 1994, Stroudsburg, PA, USA, pp. 133–138. Association for Computational Linguistic.Zechner M. , Muhr M. , Kern R. and Granitzer M. , External and intrinsic plagiarism detection using vector space models, In CEUR Workshop Proceedings, vol. 502, 2009, pp. 47–55

    English-Persian Plagiarism Detection based on a Semantic Approach

    Get PDF
    Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-lingual plagiarism. In cross-lingual translation, writers meld a translation with their own words and ideas. Based on monolingual plagiarism detection methods, this paper ultimately intends to find a way to detect cross-lingual plagiarism. A framework called Multi-Lingual Plagiarism Detection (MLPD) has been presented for cross-lingual plagiarism analysis with ultimate objective of detection of plagiarism cases. English is the reference language and Persian materials are back translated using translation tools. The data for assessment of MLPD were obtained from English-Persian Mizan parallel corpus. Apache’s Solr was also applied to record the creep of the documents and their indexation. The accuracy mean of the proposed method revealed to be 98.82% when employing highly accurate translation tools which indicate the high accuracy of the proposed method. Also, Google translation service showed the accuracy mean to be 56.9%. These tests demonstrate that improved translation tools enhance the accuracy of the proposed method

    Plagiarism detection for document

    Get PDF
    Our project aims to provide plagiarism based on semantic detection and natural language processing technique. Plagiarism detection for document is very effective technique, as nowadays students are mainly dependent on Internet. . The wide use and availability of electronic resources makes it easy for students, authors and even academic people to access and use any piece of information and embed it into his/ her own work without proper citation. Our project help authors, writers etc. to secure their files and make their files safe. It helps the user to upload the file easily and detect plagiarism more efficiently. It gives the more accurate results. This web application will help the users to upload the files and check for the plagiarism more easily and securely

    Monolingual Plagiarism Detection and Paraphrase Type Identification

    Get PDF

    A review of detection plagiarism in indonesian language

    Get PDF
    Plagiarism is the act of copying the work of another person in the form of writing, ideas, creative ideas or other without including the source of the work or idea. This action is of course very disrespectful, violates the code of ethics and is opposed by all parties, both by scientists and government. This happens because the use of the internet provides unlimited information services. Many studies have been carried out, raising the theme of this plagiarism. This article will review how far the plagiarism research has been done on Indonesian writing. By knowing the development of plagiarism research, further research will have better sustainability

    WASABI: a Two Million Song Database Project with Audio and Cultural Metadata plus WebAudio enhanced Client Applications

    Get PDF
    This paper presents the WASABI project, started in 2017, which aims at (1) the construction of a 2 million song knowledge base that combines metadata collected from music databases on the Web, metadata resulting from the analysis of song lyrics, and metadata resulting from the audio analysis, and (2) the development of semantic applications with high added value to exploit this semantic database. A preliminary version of the WASABI database is already online1 and will be enriched all along the project. The main originality of this project is the collaboration between the algorithms that will extract semantic metadata from the web and from song lyrics with the algorithms that will work on the audio. The following WebAudio enhanced applications will be associated with each song in the database: an online mixing table, guitar amp simulations with a virtual pedal-board, audio analysis visualization tools, annotation tools, a similarity search tool that works by uploading audio extracts or playing some melody using a MIDI device are planned as companions for the WASABI database
    corecore