1,150 research outputs found

    Do Language Models Plagiarize?

    Full text link
    Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2

    Scene illumination classification based on histogram quartering of CIE-Y component

    Get PDF
    Despite the rapidly expanding research into various aspects of illumination estimation methods, there are limited number of studies addressing illumination classification for different purposes. The increasing demand for color constancy process, wide application of it and high dependency of color constancy to illumination estimation makes this research topic challenging. Definitely, an accurate estimation of illumination in the image will provide a better platform for doing correction and finally will lead in better color constancy performance. The main purpose of any illumination estimation algorithm from any type and class is to estimate an accurate number as illumination. In scene illumination estimation dealing with large range of illumination and small variation of it is critical. Those algorithms which performed estimation carrying out lots of calculation that leads in expensive methods in terms of computing resources. There are several technical limitations in estimating an accurate number as illumination. In addition using light temperature in all previous studies leads to have complicated and computationally expensive methods. On the other hand classification is appropriate for applications like photography when most of the images have been captured in a small set of illuminants like scene illuminant. This study aims to develop a framework of image illumination classifier that is capable of classifying images under different illumination levels with an acceptable accuracy. The method will be tested on real scene images captured with illumination level is measured. This method is a combination of physic based methods and data driven (statistical) methods that categorize the images based on statistical features extracted from illumination histogram of image. The result of categorization will be validated using inherent illumination data of scene. Applying the improving algorithm for characterizing histograms (histogram quartering) handed out the advantages of high accuracy. A trained neural network which is the parameters are tuned for this specific application has taken into account in order to sort out the image into predefined groups. Finally, for performance and accuracy evaluation misclassification error percentages, Mean Square Error (MSE), regression analysis and response time are used. This developed method finally will result in a high accuracy and straightforward classification system especially for illumination concept. The results of this study strongly demonstrate that light intensity with the help of a perfectly tuned neural network can be used as the light property to establish a scene illumination classification system

    Something borrowed: sequence alignment and the identification of similar passages in large text collections

    No full text
    The following article describes a simple technique to identify lexically-similar passages in large collections of text using sequence alignment algorithms. Primarily used in the field of bioinformatics to identify similar segments of DNA in genome research, sequence alignment has also been employed in many other domains, from plagiarism detection to image processing. While we have applied this approach to a wide variety of diverse text collections, we will focus our discussion here on the identification of similar passages in the famous 18th-century Encyclopédie of Denis Diderot and Jean d'Alembert. Reference works, such as encyclopedias and dictionaries, are generally expected to "reuse" or "borrow" passages from many sources and Diderot and d'Alembert's Encyclopédie was no exception. Drawn from an immense variety of source material, both French and non-French, many, if not most, of the borrowings that occur in the Encyclopédie are not sufficiently identified (according to our standards of modern citation), or are only partially acknowledged in passing. The systematic identification of recycled passages can thus offer us a clear indication of the sources the philosophes were exploiting as well as the extent to which the intertextual relations that accompanied its composition and subsequent reception can be explored. In the end,we hope this approach to "Encyclopedic intertextuality" using sequence alignment can broaden the discussion concerning the relationship of Enlightenment thought to previous intellectual traditions as well as its reuse in the centuries that followed

    Phishing website detection using genetic algorithm-based feature selection and parameter hypertuning

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsFalse webpages are created by cyber attackers who seek to mislead users into revealing sensitive and personal information, from credit card details to passwords. Phishing is a class of cyber attacks that mislead users into clicking on false websites, logging into related accounts, and subsequently stealing funds. This cyberattack increases annually given the exponential increase of e-commerce customers, which causes difficulty to distinguish between harmless and false websites. The conventional methods to detect phishing websites are focused on a database of blacklisted and whitelisted. Such methods are not capable to detect new phishing websites. To solve this problem, researchers are developing machine learning (ML) and deep learning-based methods. In this dissertation, a hybrid-based solution, which uses genetic algorithms and ML algorithms for phishing detection based on the URL of the website is proposed. Regarding evaluation, comparisons between conventional ML and DL models are performed using various feature sets resulting from commonly used feature selection methods, such as mutual information and recursive feature elimination. This dissertation proposes a final model with an accuracy of 95.34% on the test set

    Software Plagiarism Detection Using N-grams

    Get PDF
    Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The motivations behind plagiarism can vary from completing academic courses to even gaining economical advantage. Plagiarism exists in various domains, where people want to take credit from something they have worked on. These areas can include e.g. literature, art or software, which all have a meaning for an authorship. In this thesis we conduct a systematic literature review from the topic of source code plagiarism detection methods, then based on the literature propose a new approach to detect plagiarism which combines both similarity detection and authorship identification, introduce our tokenization method for the source code, and lastly evaluate the model by using real life data sets. The goal for our model is to point out possible plagiarism from a collection of documents, which in this thesis is specified as a collection of source code files written by various authors. Our data, which we will use to our statistical methods, consists of three datasets: (1) collection of documents belonging to University of Helsinki’s first programming course, (2) collection of documents belonging to University of Helsinki’s advanced programming course and (3) submissions for source code re-use competition. Statistical methods in this thesis are inspired by the theory of search engines, which are related to data mining when detecting similarity between documents and machine learning when classifying document with the most likely author in authorship identification. Results show that our similarity detection model can be used successfully to retrieve documents for further plagiarism inspection, but false positives are quickly introduced even when using a high threshold that controls the minimum allowed level of similarity between documents. We were unable to use the results of authorship identification in our study, as the results with our machine learning model were not high enough to be used sensibly. This was possibly caused by the high similarity between documents, which is due to the restricted tasks and the course setting that teaches a specific programming style during the timespan of the course

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Acta Cybernetica : Volume 19. Number 1.

    Get PDF

    TOPIC CLASSIFICATION USING HYBRID OF UNSUPERVISED AND SUPERVISED LEARNING

    Get PDF
    There has been research around the idea of representing words in text as vectors and many models proposed that vary in performance as well as applications. Text processing is used for content recommendation, sentiment analysis, plagiarism detection, content creation, language translation, etc. to name a few. Specifically, we want to look at the problem of topic detection in text content of articles/blogs/summaries. With the humungous amount of text content published each and every minute on the internet, it is imperative that we have very good algorithms and approaches to analyze all the content and be able to classify most of it with high confidence for further use. The project aims to work with unsupervised and supervised machine learning algorithms in an effort to tackle the topic detection problem. The project will target various unsupervised learning algorithms like Word2vec, doc2vec and LDA for corpus and language dictionary learning to have a trained model which understand the semantic of texts. The objective of the project is to combine this unsupervised learning with supervised learning algorithms like Support Vector Machine and deep learning methods to analyze and hopefully better the performance in terms of accuracy of topic detection. The project also aims at performing user interest-based modelling, which is orthogonal to topics modeling. The idea is to make sure the model is free of predefined categories. The project results show that hybrid models are comfortably accurate when classifying text in a particular topic category. The project also concludes that user interest modelling can also be accurately achieved along with topic detection. The project successfully determines these results without any meta information about the input text and purely based on the corpus of the input text. This makes the project framework really robust as it has no dependency on source of text, length of text or any other meta information about the text content
    corecore