25 research outputs found

    Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.

    Get PDF
    International audienceWe describe here the technical details of our participation to PAN 2012's "traditional" authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the "rich" linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool

    Monolingual Plagiarism Detection and Paraphrase Type Identification

    Get PDF

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Psychotropes: Models of Authorship, Psychopathology, and Molecular Politics in Aldous Huxley and Philip K. Dick

    Get PDF
    Among the so-called “anti-psychiatrists” of the 1960s and ‘70s, it was FĂ©lix Guattari who first identified that psychiatry had undergone a “molecular revolution.” It was in fact in a book titled Molecular Revolutions, published in 1984, that Guattari proposed that psychotherapy had become, in the deÂŹcades following the Second World War, far less personal and increasingly alienating. The newly “molecular” practices of psychiatry, Guattari mourned, had served only to fundamentally distance both patients and practitioners from their own minds; they had largely restricted our access, he suggested, to human subjectivity and consciousness. This thesis resumes Guattari’s work on the “molecular” model of the subject. Extending on Guattari’s various “schizoanalytic metamodels” of huÂŹman consciousness and ontology, it rigorously meditates on a simple quesÂŹtion: Should we now accept the likely finding that there is no neat, singular, reductive, utilitarian, or unifying “model” for thinking about the human subject, and more specifically the human “author”? Part 1 of this thesis carefully examines a range of psychoanalytic, psychiÂŹatric, philosophical, and biomedical models of the human. It studies and reÂŹformulates each of them in turn and, all the while, returns to a fundamental position: that no single model, nor combination of them, will suffice. What part 1 seeks to demonstrate, then, is that envisioning these models as differÂŹent attempts to “know” the human is fruitless—a futile game. Instead, these models should be understood in much the same way as literary critics treat literary commonplaces or topoi; they are akin, I argue, to what Deleuze and Guattari called “images of thought.” In my terminology, they are “psychoÂŹtropes”: images with their own particular symbolic and mythical functions. Having thus developed a range of theoretical footholds in part 1, part 2 of the thesis—beginning in chapter 4—will put into practice the work of this first part. It will do so by examining various representations of authorship by two authors in particular: Aldous Huxley and Philip K. Dick. This part will thus demonstrate how these author figures function as “psychoactive scrivÂŹeners”: they are fictionalising philosophers who both produce and quarrel with an array of paradigmatic psychotropes, disputing those of others and inventing their own to substitute for them. More than this, however, the second part offers a range of detailed and original readings of these authors’s psychobiographies; it argues that even individual authors such as Huxley and Dick can be seen as “psychotropic.” It offers, that is, a series of broad-ranging and speculative explanations for the ideas and themes that appear in their works—explanations rooted in the theoretical work of the first part. Finally, this thesis concludes by reaffirming the importance of these authors’s narcoliteratures—both for present-day and future literary studies, and beyond. For while Huxley and Dick allow us to countenance afresh the range of failures in the history and philosophy of science, they also promÂŹise to instruct us—and instruct science—about the ways in which we might move beyond our received mimetic models of the human

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Exploring Written Artefacts

    Get PDF
    This collection, presented to Michael Friedrich in honour of his academic career at of the Centre for the Study of Manuscript Cultures, traces key concepts that scholars associated with the Centre have developed and refined for the systematic study of manuscript cultures. At the same time, the contributions showcase the possibilities of expanding the traditional subject of ‘manuscripts’ to the larger perspective of ‘written artefacts’

    World Beats

    Get PDF
    This fascinating book explores Beat Generation writing from a transnational perspective, using the concept of worlding to place Beat literature in conversation with a far-reaching network of cultural and political formations. Countering the charge that the Beats abroad were at best naĂŻve tourists seeking exoticism for exoticism's sake, World Beats finds that these writers propelled a highly politicized agenda that sought to use the tools of the earlier avant-garde to undermine Cold War and postcolonial ideologies and offer a new vision of engaged literature. With fresh interpretations of central Beat authors Jack Kerouac, Allen Ginsberg, and William Burroughs - as well as usually marginalized writers like Philip Lamantia, Ted Joans, and Brion Gysin - World Beats moves beyond national, continental, or hemispheric frames to show that embedded within Beat writing is an essential universality that brought America to the world and the world to American literature

    Keys to The Gift

    Get PDF
    "Yuri Leving’s Keys to The Gift: A Guide to Vladimir Nabokov’s Novel is a new systematization of the main available data on Nabokov’s most complex Russian novel, The Gift (1934–1939). From notes in Nabokov’s private correspondence to scholarly articles accumulated during the seventy years since the novel’s first appearance in print, this work draws from a broad spectrum of existing material in a succinct and coherent way and provides innovative analyses. The first part of the monograph, “The Novel,” outlines the basic properties of The Gift (plot, characters, style, and motifs) and reconstructs its internal chronology. The second part, “The Text,” describes the creation of the novel and the history of its publication, public and critical reaction, challenges of English translation, and post-Soviet reception. Along with annotations to all five chapters of The Gift, the commentary provides insight into problems of paleography, featuring a unique textological analysis of the novel
    corecore