    AntiPlag: Plagiarism Detection on Electronic Submissions of Text Based Assignments

    Plagiarism is one of the growing issues in academia and is always a concern in Universities and other academic institutions. The situation is becoming even worse with the availability of ample resources on the web. This paper focuses on creating an effective and fast tool for plagiarism detection for text based electronic assignments. Our plagiarism detection tool named AntiPlag is developed using the tri-gram sequence matching technique. Three sets of text based assignments were tested by AntiPlag and the results were compared against an existing commercial plagiarism detection tool. AntiPlag showed better results in terms of false positives compared to the commercial tool due to the pre-processing steps performed in AntiPlag. In addition, to improve the detection latency, AntiPlag applies a data clustering technique making it four times faster than the commercial tool considered. AntiPlag could be used to isolate plagiarized text based assignments from non-plagiarised assignments easily. Therefore, we present AntiPlag, a fast and effective tool for plagiarism detection on text based electronic assignments

    Qlusty: Quick and dirty generation of event videos from written media coverage

    Qlusty generates videos describing the coverage of the same event by different news outlets automatically. Throughout four modules it identifies events, de-duplicates notes, ranks according to coverage, and queries for images to generate an overview video. In this manuscript we present our preliminary models, including quantitative evaluations of the former two and a qualitative analysis of the latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common in the current news landscape

    Efficient Similarity Measures for Texts Matching

    Calculation of similarity measures of exact matching texts is a critical task in the area of pattern matching that needs a great attention. There are many existing similarity measures in literature but the best methods do not exist for closeness measurement of two strings. The objective of this paper is to explore the grammatical properties and features of generalized n-gram matching technique of similarity measures to find exact text in electronic computer applications. Three new similarity measures have been proposed to improve the performance of generalized n-gram method. The new methods assigned high values of similarity measures and performance to price with low values of running time. The experiment with the new methods demonstrated that they are universal and very useful in words that could be derived from the word list as a group and retrieve relevant medical terms from database . One of the methods achieved best correlation of values for the evaluation of subjective examination

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Towards Document Plagiarism Detection Based on the Relevance and Fragmentation of the Reused Text

    Plagiarism Detection using Enhanced Relative Frequency Model

    As the world is running towards greater heights of technology, it’s becoming more complex to secure data from being copied. So it’s better to detect the copied contents rather than securing the contents. Here, contents cover digital documents of scientific research, articles in newspapers, journals and assignments submitted by students. There are so many tools and algorithms to detect plagiarism, but the time complexity of the algorithm really matters where document comparison is against giant data set. Vector based methods are quite frequently used in the detection process of plagiarism. There are so many vector based methods, but having some drawbacks. In SCAM approach, selection of 'e'(epcilon) value is a drawback as 'e' value decides the closeness set and daniel approach fails to identify plagiarism when there were repeated terms in a sentence. Here we are proposing a new algorithm, which is developed using the concepts of the Relative Frequency Model overcomes the drawbacks involved in existing methods. In the implementation of our proposed method, we employed sentence splitter, stop-word removal process, and stemming of words

    Paraphrase Plagiarism Identifcation with Character-level Features

    [EN] Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identi cation consists in automatically recognizing document fragments that contain re-used text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes, and morphological or lexical substitutions. Our main hypothesis establishes that the original author's writing style ngerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identi cation, which represents a new valuable resource to the NLP community for future research work in this field.This work is the result of the collaboration in the framework of the CONACYT Thematic Networks program (RedTTL Language Technologies Network) and the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action. The first author was supported by CONACYT (Scholarship 258345/224483). The second, third, and sixth authors were partially supported by CONACyT (Project Grants 258588 and 2410). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the Grant ALMAMATER (PrometeoII/2014/030).Sánchez-Vega, F.; Villatoro-Tello, E.; Montes-Y-Gómez, M.; Rosso, P.; Stamatatos, E.; Villaseñor-Pineda, L. (2019). Paraphrase Plagiarism Identifcation with Character-level Features. Pattern Analysis and Applications. 22(2):669-681. https://doi.org/10.1007/s10044-017-0674-zS66968122

    A review of detection plagiarism in indonesian language

    Plagiarism is the act of copying the work of another person in the form of writing, ideas, creative ideas or other without including the source of the work or idea. This action is of course very disrespectful, violates the code of ethics and is opposed by all parties, both by scientists and government. This happens because the use of the internet provides unlimited information services. Many studies have been carried out, raising the theme of this plagiarism. This article will review how far the plagiarism research has been done on Indonesian writing. By knowing the development of plagiarism research, further research will have better sustainability