30 research outputs found

    A Machine Learning Approach for Plagiarism Detection

    Get PDF
    Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists

    The Stylometric Processing of Sensory Open Source Data

    Get PDF
    This research project’s end goal is on the Lone Wolf Terrorist. The project uses an exploratory approach to the self-radicalisation problem by creating a stylistic fingerprint of a person's personality, or self, from subtle characteristics hidden in a person's writing style. It separates the identity of one person from another based on their writing style. It also separates the writings of suicide attackers from ‘normal' bloggers by critical slowing down; a dynamical property used to develop early warning signs of tipping points. It identifies changes in a person's moods, or shifts from one state to another, that might indicate a tipping point for self-radicalisation. Research into authorship identity using personality is a relatively new area in the field of neurolinguistics. There are very few methods that model how an individual's cognitive functions present themselves in writing. Here, we develop a novel algorithm, RPAS, which draws on cognitive functions such as aging, sensory processing, abstract or concrete thinking through referential activity emotional experiences, and a person's internal gender for identity. We use well-known techniques such as Principal Component Analysis, Linear Discriminant Analysis, and the Vector Space Method to cluster multiple anonymous-authored works. Here we use a new approach, using seriation with noise to separate subtle features in individuals. We conduct time series analysis using modified variants of 1-lag autocorrelation and the coefficient of skewness, two statistical metrics that change near a tipping point, to track serious life events in an individual through cognitive linguistic markers. In our journey of discovery, we uncover secrets about the Elizabethan playwrights hidden for over 400 years. We uncover markers for depression and anxiety in modern-day writers and identify linguistic cues for Alzheimer's disease much earlier than other studies using sensory processing. In using these techniques on the Lone Wolf, we can separate their writing style used before their attacks that differs from other writing

    Dating Victorians: an experimental approach to stylochronometry

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy ofthe University of LutonThe writing style of a number of authors writing in English was empirically investigated for the purpose of detecting stylistic patterns in relation to advancing age. The aim was to identify the type of stylistic markers among lexical, syntactical, phonemic, entropic, character-based, and content ones that would be most able to discriminate between early, middle, and late works of the selected authors, and the best classification or prediction algorithm most suited for this task. Two pilot studies were initially conducted. The first one concentrated on Christina Georgina Rossetti and Edgar Allan Poe from whom personal letters and poetry were selected as the genres of study, along with a limited selection of variables. Results suggested that authors and genre vary inconsistently. The second pilot study was based on Shakespeare's plays using a wider selection of variables to assess their discriminating power in relation to a past study. It was observed that the selected variables were of satisfactory predictive power, hence judged suitable for the task. Subsequently, four experiments were conducted using the variables tested in the second pilot study and personal correspondence and poetry from two additional authors, Edna St Vincent Millay and William Butler Yeats. Stepwise multiple linear regression and regression trees were selected to deal with the first two prediction experiments, and ordinal logistic regression and artificial neural networks for two classification experiments. The first experiment revealed inconsistency in accuracy of prediction and total number of variables in the final models affected by differences in authorship and genre. The second experiment revealed inconsistencies for the same factors in terms of accuracy only. The third experiment showed total number of variables in the model and error in the final model to be affected in various degrees by authorship, genre, different variable types and order in which the variables had been calculated. The last experiment had all measurements affected by the four factors. Examination of whether differences in method within each task play an important part revealed significant influences of method, authorship, and genre for the prediction problems, whereas all factors including method and various interactions dominated in the classification problems. Given the current data and methods used, as well as the results obtained, generalizable conclusions for the wider author population have been avoided

    The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach

    Get PDF
    PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in 2007 that Samuel Taylor Coleridge was the author of the anonymous translation of Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing is stylometric. Specifically, function word usage is selected as the stylometric criterion, and 80 function words are used to define a 73-dimensional function word frequency profile vector for each text in the corpus of Coleridge's literary works and for a selection of works by a range of contemporary English authors. Each profile vector is a point in 80- dimensional vector space, and cluster analytic methods are used to determine the distribution of profile vectors in the space. If the hypothesis being tested is valid, then the profile for the 1821 translation should be closer in the space to works known to be by Coleridge than to works by the other authors. The cluster analytic results show, however, that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis is falsified relative to the stylometric criterion and analytic methodology used

    Stylistic atructures: a computational approach to text classification

    Get PDF
    The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system

    A model for stylometric analysis of e-mails for recipient-based personalised writing

    Get PDF
    Trabajo de Fin de Grado en Doble Grado en Ingeniería Informática y Matemáticas, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2019/2020Hoy en día se envían más de 306 mil millones de correos electrónicos diarios tanto en el ámbito profesional como el personal. Sin embargo, a pesar de que el canal sea el mismo, nuestro estilo varía en función del destinatario del mensaje. La estilometría en correos electrónicos es un campo de estudio reciente que trata de parametrizar el estilo de escritura a través de métricas. La mayoría de las investigaciones en este campo se centran en la detección de spam o identificación y autenticación de la autoría de los mensajes. En este trabajo se plantea un nuevo enfoque: estudiar el estilo dependiendo del destinatario del correo electrónico. El avance en esta dirección permitiría personalizar los sistemas de redacción de correos electrónicos de manera que fueran capaces de generar mensajes distintos en función del destinatario. En este trabajo se desarrolla una herramienta de análisis estilométrico de correos electrónicos, para el servicio de Gmail, que permite extraer y calcular distintas métricas de los mensajes de un usuario. Dicho analizador de estilo cuenta con cuatro módulos (extracción, preprocesamiento, corrección tipográfica y medición de estilo) que abordan las distintas fases necesarias para obtener los descriptores de estilo de cada uno de los mensajes. Una vez se cuenta con los resultados al evaluar las distintas métricas sobre cada mensaje, se analizan. Para ello se hace uso de populares técnicas de aprendizaje automático como K-Medias, Análisis de Componentes Principales y Árboles de Decisión. El objetivo es extraer conclusiones que permitan proponer un modelo de análisis estilométrico de correos electrónicos para la redacción personalizada basada en el destinatario. En este análisis de datos se encuentran ocho métricas que distinguen mejor el estilo en función del receptor de la información. Por último, se presenta el diseño de un sistema que utiliza estas ocho métricas para redactar correos electrónicos distintos según el destinatario. Este modelo puede ser de utilidad para personalizar aquellos sistemas de generación de lenguaje natural en función del destinatario, o de la audiencia a la que va dirigida el texto.Nowadays, more than 306 billion e-mails are sent daily, both in the professional and personal scopes. However, despite the fact that the channel is the same, our style varies depending on the recipient of the message. Stylometry in e-mails is a recent field of study that tries to obtain the definition of writing style through metrics. Nevertheless, most research in this field focuses on spam detection or message author identification and authentication. In this work a new approach is proposed: to study the style depending on the recipient of the e-mail. Moving in this direction would allow us to personalise e-mail writing systems so that they are capable of generating different messages depending on the recipient. In this work we develop a tool for the stylometric analysis of e-mails, for the Gmail service, which allows us to extract and calculate different metrics from the messages of a user. This style analyser has four modules (extraction, preprocessing, typographic correction and style measuring) that deal with the different phases needed to obtain the style descriptors of each of the messages. Once we have the results of evaluating the different metrics on each message, we analyse them. To this end, we use popular machine learning techniques such as K-Means, Principal Component Analysis and Decision Trees. The objective is to draw conclusions that allow us to propose a model of stylometric analysis of e-mails for personalized writing based on the recipient. In this data analysis we find eight metrics that better distinguish style according to the receiver of the information. Finally, we present the design of a system that uses these eight metrics to write different e-mails according to the recipient. This model can be useful to personalise those natural language generation systems depending on the recipient, or on the audience to which the text is addressed.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu

    Stylistic atructures: a computational approach to text classification

    Get PDF
    The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    AIUCD2017 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract degli interventi presentati alla conferenza AIUCD 2017. AIUCD 2017 si è svolta dal 26 al 28 Gennaio 2017 a Roma, ed è stata verrà organizzata dal Digilab, Università Sapienza in cooperazione con il network ITN DiXiT (Digital Scholarly Editions Initial Training Network). AIUCD 2017 ha ospitato anche la terza edizione dell’EADH Day, tenutosi il 25 Gennaio 2017. Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima sotto la responsabilità del Comitato di Programma Internazionale di AIUCD 2017
    corecore