10 research outputs found

    Textual and structural approaches to detecting figure plagiarism in scientific publications

    Get PDF
    The figures play important role in disseminating important ideas and findings which enable the readers to understand the details of the work. The part of figures in understanding the details of the documents increase more use of them, which have led to a serious problem of taking other peoples’ figures without giving credit to the source. Although significant efforts have been made in developing methods for estimating pairwise diagram figure similarity, there are little attentions found in the research community to detect any of the instances of figure plagiarism such as manipulating figures by changing the structure of the figure, inserting, deleting and substituting the components or when the text content is manipulated. To address this gap, this project compares theeffectiveness of the textual and structural representations of techniques to support the figure plagiarism detection. In addition to these two representations, the textual comparison method is designed to match the figure contents based on a word-gram representation using the Jaccard similarity measure, while the structural comparison method is designed to compare the text within the components as well as the relationship between the components of the figures using graph edit distance measure. These techniques are experimentally evaluated across the seven instances of figure plagiarism, in terms of their similarity values and the precision and recall metrics. The experimental results show that the structural representation of figures slightly outperformed the textual representation in detecting all the instances of the figure plagiarism

    A Deep Learning Approach to Persian Plagiarism Detection

    Get PDF
    ABSTRACT Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and lack of proper algorithms for Persian plagiarism detection, in this paper, we propose a deep learning based method to detect plagiarism. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors for sentence representation. By comparing representations of source and suspicious sentences, pair sentences with the highest similarity are considered as the candidates for plagiarism. The decision on being plagiarism is performed using a two level evaluation method. Our method has been used in PAN2016 Persian plagiarism detection contest and results in %90.6 plagdet, %85.8 recall, and % 95.9 precision on the provided data sets. CCS Concepts • Information systems → Near-duplicate and plagiarism detection • Information systems → Evaluation of retrieval results

    A Survey of Scholarly Data: From Big Data Perspective

    Get PDF
    Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle

    Detekce duplicit v rozsáhlých webových bázích dat

    Get PDF
    Tato diplomová práce se zabývá metodami používanými k detekci duplicitních dokumentů, a možností jejich integrace do internetového vyhledávače. Nabízí přehled běžně používaných metod, z nichž vybírá metodu aproximace Jaccardovy míry podobnosti v kombinaci se šindelováním. Vybranou metodu přizpůsobuje k implementaci v prostředí internetového vyhledávače Egothor. Cílem práce je představit tuto implementaci, popsat její vlastnosti a nalézt nejvhodnější parametry tak, aby detekce probíhala pokud možno v reálném čase. Důležitou vlastností metody je také možnost vykonávat dynamické změny nad kolekcí indexovaných dokumentů.This master thesis analyses the methods used for duplicity document detection and possibilities of their integration with a web search engine. It offers an overview of commonly used methods, from which it chooses the method of approximation of the Jaccard similarity measure in combination with shingling. The chosen method is adapted for implementation in the Egothor web search engine environment. The aim of the thesis is to present this implementation, describe its features, and find the most suitable parameters for the detection to run in real time. An important feature of the described method is also the possibility to make dynamic changes over the collection of indexed documents.Department of Software EngineeringKatedra softwarového inženýrstvíFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

    Comparing Texts of the \u3ci\u3eOkmulgee Constitution\u3c/i\u3e: Fourteen Instrument Versions and Levenshtein’s Edit Distance Metric

    Get PDF
    In 1870, the Five Civilized and other tribes within the Indian Territory initiated a series of council meetings to deal with seven federal stipulations presented at Fort Smith in 1865 and with new treaties established in 1866. One development was the so-called December 1870 Okmulgee Constitution, fashioned in the Creek capital, that provided a model for a new full-fledged Indian state to replace the Territory. Various versions of the text of that document (and of a revised rendition) were published, as part of the official and unofficial record of the sequence of proceedings. This study examined fourteen variants of that Okmulgee Constitution, in terms of the documents‘ provenance and of their variability as quantified through the application of Levenshtein‘s edit distance algorithm

    A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

    Get PDF
    Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore