2,509 research outputs found

    Boosting word frequencies in authorship attribution

    Full text link
    In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings

    Text stylometry for chat bot identification and intelligence estimation.

    Get PDF
    Authorship identification is a technique used to identify the author of an unclaimed document, by attempting to find traits that will match those of the original author. Authorship identification has a great potential for applications in forensics. It can also be used in identifying chat bots, a form of intelligent software created to mimic the human conversations, by their unique style. The online criminal community is utilizing chat bots as a new way to steal private information and commit fraud and identity theft. The need for identifying chat bots by their style is becoming essential to overcome the danger of online criminal activities. Researchers realized the need to advance the understanding of chat bots and design programs to prevent criminal activities, whether it was an identity theft or even a terrorist threat. The more research work to advance chat bots’ ability to perceive humans, the more duties needed to be followed to confront those threats by the research community. This research went further by trying to study whether chat bots have behavioral drift. Studying text for Stylometry has been the goal for many researchers who have experimented many features and combinations of features in their experiments. A novel feature has been proposed that represented Term Frequency Inverse Document Frequency (TFIDF) and implemented that on a Byte level N-Gram. Term Frequency-Inverse Token Frequency (TF-ITF) used these terms and created the feature. The initial experiments utilizing collected data demonstrated the feasibility of this approach. Additional versions of the feature were created and tested for authorship identification. Results demonstrated that the feature was successfully used to identify authors of text, and additional experiments showed that the feature is language independent. The feature successfully identified authors of a German text. Furthermore, the feature was used in text similarities on a book level and a paragraph level. Finally, a selective combination of features was used to classify text that ranges from kindergarten level to scientific researches and novels. The feature combination measured the Quality of Writing (QoW) and the complexity of text, which were the first step to correlate that with the author’s IQ as a future goal

    Genre analysis of online encyclopedias : the case of Wikipedia

    Get PDF

    Characterization of Prose by Rhetorical Structure for Machine Learning Classification

    Get PDF
    Measures of classical rhetorical structure in text can improve accuracy in certain types of stylistic classification tasks such as authorship attribution. This research augments the relatively scarce work in the automated identification of rhetorical figures and uses the resulting statistics to characterize an author\u27s rhetorical style. These characterizations of style can then become part of the feature set of various classification models. Our Rhetorica software identifies 14 classical rhetorical figures in free English text, with generally good precision and recall, and provides summary measures to use in descriptive or classification tasks. Classification models trained on Rhetorica\u27s rhetorical measures paired with lexical features typically performed better at authorship attribution than either set of features used individually. The rhetorical measures also provide new stylistic quantities for describing texts, authors, genres, etc

    Shakespeare: editions and textural studies

    Get PDF
    Shakespeare: editions and textural studie

    Weathered Words : Formulaic Language and Verbal Art

    Get PDF
    Formulaic phraseology presents the epitome of words worn and weathered by trial and the tests of time. Scholarship on weathered words is exceptionally diverse and interdisciplinary. This volume focuses on verbal art, which makes Oral-Formulaic Theory (OFT) a major point of reference. Yet weathered words are but a part of OFT, and OFT is only a part of scholarship on weathered words. Each of the eighteen essays gathered here brings particular aspects of formulaic language into focus. No volume on such a diverse topic can be all-encompassing, but the essays highlight aspects of the phenomenon that may be eclipsed elsewhere: they diverge not only in style, but sometimes even in how they choose to define “formula.” As such, they offer overlapping frames that complement one another both in their convergences and their contrasts. While they view formulaicity from multifarious angles, they unite in a Picasso of perspectives on which the reader can reflect and draw insight.Peer reviewe

    Automatic Image Captioning with Style

    Get PDF
    This thesis connects two core topics in machine learning, vision and language. The problem of choice is image caption generation: automatically constructing natural language descriptions of image content. Previous research into image caption generation has focused on generating purely descriptive captions; I focus on generating visually relevant captions with a distinct linguistic style. Captions with style have the potential to ease communication and add a new layer of personalisation. First, I consider naming variations in image captions, and propose a method for predicting context-dependent names that takes into account visual and linguistic information. This method makes use of a large-scale image caption dataset, which I also use to explore naming conventions and report naming conventions for hundreds of animal classes. Next I propose the SentiCap model, which relies on recent advances in artificial neural networks to generate visually relevant image captions with positive or negative sentiment. To balance descriptiveness and sentiment, the SentiCap model dynamically switches between two recurrent neural networks, one tuned for descriptive words and one for sentiment words. As the first published model for generating captions with sentiment, SentiCap has influenced a number of subsequent works. I then investigate the sub-task of modelling styled sentences without images. The specific task chosen is sentence simplification: rewriting news article sentences to make them easier to understand. For this task I design a neural sequence-to-sequence model that can work with limited training data, using novel adaptations for word copying and sharing word embeddings. Finally, I present SemStyle, a system for generating visually relevant image captions in the style of an arbitrary text corpus. A shared term space allows a neural network for vision and content planning to communicate with a network for styled language generation. SemStyle achieves competitive results in human and automatic evaluations of descriptiveness and style. As a whole, this thesis presents two complete systems for styled caption generation that are first of their kind and demonstrate, for the first time, that automatic style transfer for image captions is achievable. Contributions also include novel ideas for object naming and sentence simplification. This thesis opens up inquiries into highly personalised image captions; large scale visually grounded concept naming; and more generally, styled text generation with content control

    Naming and Renaming Texts: Rubrics in Middle High German Miscellany Manuscripts

    Get PDF
    This article analyses rubrics in Middle High German miscellany manuscripts of short texts in rhyming couplets (Reimpaargedichte). A corpus consisting of 1433 rubrics from 68 manuscripts was created to be able to perform this study. As rubrics in medieval manuscripts were not authorial, but composed by scribes, they offer insights into the reception of the texts. This paper analyses their features and functions as a proxy to interrogate the standing and status of Reimpaargedichte between the thirteenth and fifteenth centuries. The main methodology is distant reading, i.e. the application and interpretation of statistical methods on a textual corpus. The features analyzed include the length of the rubrics, their level of variation, the presence of author names, and vocabulary. Although no general patterns regarding length nor level of variation were detected, some important conclusions can be drawn: 1. there were no clear markers of literary genre in rubrics; 2. authorship was mostly absent, except for some specific cases of famous authors; 3. relatively stable keywords were used to identify particular texts, but they were more common in manuscripts with narrative texts (Erzählungen) and less common in later manuscripts dominated by the genre known as Minnereden. Furthermore, the analysis revealed that rubrics used a series of linguistic procedures to show that they participated in a different speech act than the main text – they embodied an interaction between scribes and readers, in which the former framed the reception of the work
    • …
    corecore