147 research outputs found

    Authorship Attribution Using Principal Component Analysis and Nearest Neighbor Rule for Neural Networks

    Get PDF
    Feature extraction is a common problem in statistical pattern recognition. It refers to a process whereby a data space is transformed into a feature space that, in theory, has exactly the same dimension as the original data space. However, the transformation is designed in such a way that the data set may be represented by a reduced number of "effective" features and yet retain most of the intrinsic information content of the data; in other words, the data set undergoes a dimensionality reduction. Principal component analysis is one of these processes. In this paper the data collected by counting selected syntactic characteristics in around a thousand paragraphs of each of the sample books underwent a principal component analysis. To make a comparison, the original data is also processed. Authors of texts identified with higher success by the competitive neural networks, which use principal components. The process repeated on another group of authors, and similar results are obtained

    Effective authorship attribution in large document collections

    Get PDF
    Techniques that can effectively identify authors of texts are of great importance in scenarios such as detecting plagiarism, and identifying a source of information. A range of attribution approaches has been proposed in recent years, but none of these are particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and computational cost. Good test collections are critical for evaluation of authorship attribution (AA) techniques. However, there are no standard benchmarks available in this area; it is almost always the case that researchers have their own test collections. Furthermore, collections that have been explored in AA are usually small, and thus whether the existing approaches are reliable or scalable is unclear. We develop several AA collections that are substantially larger than those in literature; machine learning methods are used to establish the value of using such corpora in AA. The results, also used as baseline results in this thesis, show that the developed text collections can be used as standard benchmarks, and are able to clearly distinguish between different approaches. One of the major contributions is that we propose use of the Kullback-Leibler divergence, a measure of how different two distributions are, to identify authors based on elements of writing style. The results show that our approach is at least as effective as, if not always better than, the best existing attribution methods-that is, support vector machines-for two-class AA, and is superior for multi-class AA. Moreover our proposed method has much lower computational cost and is cheaper to train. Style markers are the key elements of style analysis. We explore several approaches to tokenising documents to extract style markers, examining which marker type works the best. We also propose three systems that boost the AA performance by combining evidence from various marker types, motivated from the observation that there is no one type of marker that can satisfy all AA scenarios. To address the scalability of AA, we propose the novel task of authorship search (AS), inspired by document search and intended for large document collections. Our results show that AS is reasonably effective to find documents by a particular author, even within a collection consisting of half a million documents. Beyond search, we also propose the AS-based method to identify authorship. Our method is substantially more scalable than any method published in prior AA research, in terms of the collection size and the number of candidate authors; the discrimination is scaled up to several hundred authors

    Dating Victorians: an experimental approach to stylochronometry

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy ofthe University of LutonThe writing style of a number of authors writing in English was empirically investigated for the purpose of detecting stylistic patterns in relation to advancing age. The aim was to identify the type of stylistic markers among lexical, syntactical, phonemic, entropic, character-based, and content ones that would be most able to discriminate between early, middle, and late works of the selected authors, and the best classification or prediction algorithm most suited for this task. Two pilot studies were initially conducted. The first one concentrated on Christina Georgina Rossetti and Edgar Allan Poe from whom personal letters and poetry were selected as the genres of study, along with a limited selection of variables. Results suggested that authors and genre vary inconsistently. The second pilot study was based on Shakespeare's plays using a wider selection of variables to assess their discriminating power in relation to a past study. It was observed that the selected variables were of satisfactory predictive power, hence judged suitable for the task. Subsequently, four experiments were conducted using the variables tested in the second pilot study and personal correspondence and poetry from two additional authors, Edna St Vincent Millay and William Butler Yeats. Stepwise multiple linear regression and regression trees were selected to deal with the first two prediction experiments, and ordinal logistic regression and artificial neural networks for two classification experiments. The first experiment revealed inconsistency in accuracy of prediction and total number of variables in the final models affected by differences in authorship and genre. The second experiment revealed inconsistencies for the same factors in terms of accuracy only. The third experiment showed total number of variables in the model and error in the final model to be affected in various degrees by authorship, genre, different variable types and order in which the variables had been calculated. The last experiment had all measurements affected by the four factors. Examination of whether differences in method within each task play an important part revealed significant influences of method, authorship, and genre for the prediction problems, whereas all factors including method and various interactions dominated in the classification problems. Given the current data and methods used, as well as the results obtained, generalizable conclusions for the wider author population have been avoided

    Making Machines Learn. Applications of Cultural Analytics to the Humanities

    Get PDF
    The digitization of several million books by Google in 2011 meant the popularization of a new kind of humanities research powered by the treatment of cultural objects as data. Culturomics, as it is called, was born, and other initiatives resonated with such a methodological approach, as is the case with the recently formed Digital Humanities or Cultural Analytics. Intrinsically, these new quantitative approaches to culture all borrow from techniques and methods developed under the wing of the exact sciences, such as computer science, machine learning or statistics. There are numerous examples of studies that take advantage of the possibilities that treating objects as data has to offer for the understanding of the human. This new data science that is now applied to the current trends in culture can also be replicated to study more traditional humanities. Led by proper intellectual inquiry, an adequate use of technology may bring answers to questions intractable by other means, or add evidence to long held assumptions based on a canon built from few examples. This dissertation argues in favor of such approach. Three different case studies are considered. First, in the more general sense of the big and smart data, we collected and analyzed more than 120,000 pictures of paintings from all periods of art history, to gain a clear insight on how the beauty of depicted faces, in the framework of neuroscience and evolutionary theory, has changed over time. A second study covers the nuances of modes of emotions employed by the Spanish Golden Age playwright CalderĂłn de la Barca to empathize with his audience. By means of sentiment analysis, a technique strongly supported by machine learning, we shed some light into the different fictional characters, and how they interact and convey messages otherwise invisible to the public. The last case is a study of non-traditional authorship attribution techniques applied to the forefather of the modern novel, the Lazarillo de Tormes. In the end, we conclude that the successful application of cultural analytics and computer science techniques to traditional humanistic endeavours has been enriching and validating

    A multi-disciplinary framework for cyber attribution

    Get PDF
    Effective Cyber security is critical to the prosperity of any nation in the modern world. We have become dependant upon this interconnected network of systems for a number of critical functions within society. As our reliance upon this technology has increased, as has the prospective gains for malicious actors who would abuse these systems for their own personal benefit, at the cost of legitimate users. The result has been an explosion of cyber attacks, or cyber enabled crimes. The threat from hackers, organised criminals and even nations states is ever increasing. One of the critical enablers to our cyber security is that of cyber attribution, the ability to tell who is acting against our systems. A purely technical approach to cyber attribution has been found to be ineffective in the majority of cases, taking too narrow approach to the attribution problem. A purely technical approach will provide Indicators Of Compromise (IOC) which is suitable for the immediate recovery and clean up of a cyber event. It fails however to ask the deeper questions of the origin of the attack. This can be derived from a wider set of analysis and additional sources of data. Unfortunately due to the wide range of data types and highly specialist skills required to perform the deep level analysis there is currently no common framework for analysts to work together towards resolving the attribution problem. This is further exasperated by a communication barrier between the highly specialised fields and no obviously compatible data types. The aim of the project is to develop a common framework upon which experts from a number of disciplines can add to the overall attribution picture. These experts will add their input in the form of a library. Firstly a process was developed to enable the creation of compatible libraries in different specialist fields. A series of libraries can be used by an analyst to create an overarching attribution picture. The framework will highlight any intelligence gaps and additionally an analyst can use the list of libraries to suggest a tool or method to fill that intelligence gap. By the end of the project a working framework had been developed with a number of libraries from a wide range of technical attribution disciplines. These libraries were used to feed in real time intelligence to both technical and nontechnical analysts who were then able to use this information to perform in depth attribution analysis. The pictorial format of the framework was found to assist in the breaking down of the communication barrier between disciplines and was suitable as an intelligence product in its own right, providing a useful visual aid to briefings. The simplicity of the library based system meant that the process was easy to learn with only a short introduction to the framework required

    The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach

    Get PDF
    PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in 2007 that Samuel Taylor Coleridge was the author of the anonymous translation of Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing is stylometric. Specifically, function word usage is selected as the stylometric criterion, and 80 function words are used to define a 73-dimensional function word frequency profile vector for each text in the corpus of Coleridge's literary works and for a selection of works by a range of contemporary English authors. Each profile vector is a point in 80- dimensional vector space, and cluster analytic methods are used to determine the distribution of profile vectors in the space. If the hypothesis being tested is valid, then the profile for the 1821 translation should be closer in the space to works known to be by Coleridge than to works by the other authors. The cluster analytic results show, however, that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis is falsified relative to the stylometric criterion and analytic methodology used

    Past, present and future of historical information science

    Full text link
    Der Bericht evaluiert Entwicklungen und Einflüsse von Forschungen im Bereich der empirisch orientierten Geschichtswissenschaft und deren rechnergestützten Methoden. Vorgestellt werden ein Forschungsparadigma und eine Forschungsinfrastruktur für die zukünftige historisch orientierte Informationswissenschaft. Die entscheidenden Anstöße dafür kommen eher von Außen, also nicht aus der scientific community der Assoziation for History and Computing (AHC). Die Gründe hierfür liegen darin, dass die AHC niemals klare Aussagen darüber gemacht hat, welches ihre Adressaten sind: Historiker, die sich für EDV interessieren, oder historisch orientierte Informationswissenschaftler. Das Ergebnis war, dass sich keine dieser Fraktionen angesprochen fühlte und kein Diskurs mit der 'traditionellen' Geschichtswissenschaft und der Informationswissenschaft zustande kam. Der Autor skizziert ein Forschungsprogramm, das diese Ambiguitäten vermeidet und die Ansätze in einer Forschungsinfrastruktur integriert. (ICAÜbers)'This report evaluates the impact of two decades of research within the framework of history and computing, and sets out a research paradigm and research infrastructure for future historical information science. It is good to see that there has been done a lot of historical information research in the past, much of it has been done, however, outside the field of history and computing, and not within a community like the Association for History and Computing. The reason is that the AHC never made a clear statement about what audience to address: historians with an interest in computing, or historical information scientists. As a result, both parties have not been accommodated, and communications with both 'traditional' history and 'information science' have not been established. A proper research program, based on new developments in information science, is proposed, along with an unambiguous scientific research infrastructure.' (author's abstract

    Essays in political text: new actors, new data, new challenges

    Get PDF
    The essays in this thesis explore diverse manifestations and different aspects of political text. The two main contributions on the methodological side are bringing forward novel data on political actors who were overlooked by the existing literature and application of new approaches in text analysis to address substantive questions about them. On the theoretical side this thesis contributes to the literatures on lobbying, government transparency, post-conflict studies and gender in politics. In the first paper on interest groups in the UK I argue that contrary to much of the theoretical and empirical literature mechanisms of attaining access to government in pluralist systems critically depend on the presence of limits on campaign spending. When such limits exist, political candidates invest few resources in fund-raising and, thus, most organizations make only very few if any political donations. I collect and analyse transparency data on government department meetings and show that economic importance is one of the mechanisms that can explain variation in the level of access attained by different groups. Furthermore, I show that Brexit had a diminishing effect on this relationship between economic importance and the level of access. I also study the reported purpose of meetings and, using dynamic topic models, show the temporary shifts in policy agenda during this period. The second paper argues that civil society in post-conflict settings is capable of high-quality deliberation and, while differing in their focus, both male and female can deliver arguments pertaining to the interests of broader societal groups. Using the transcripts of civil society public consultation meetings across former Yugoslavia I show that the lack of gender-sensitive transitional justice instruments could stem not from the lack of women’s 3 physical or verbal participation, but from the dynamic of speech enclaves and topical focus on different aspects of transitional justice process between genders. And, finally, the third paper maps the challenges that lie ahead with the proliferation of research that relies on multiple datasets. In a simulation study I show that, when the linking information is limited to text, the noise can potential occur at different levels and is often hard to anticipate in practice. Thus, the choice of record linkage requires balancing between these different scenarios. Taken together, the papers in this thesis advance the field of “text as data” and contribute to our understanding of multiple political phenomena
    • …
    corecore