Search CORE

2,509 research outputs found

Boosting word frequencies in authorship attribution

Author: Eder Maciej
Publication venue
Publication date: 02/11/2022
Field of study

In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings

arXiv.org e-Print Archive

Text stylometry for chat bot identification and intelligence estimation.

Author: Ali Nawaf
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/05/2014
Field of study

Authorship identification is a technique used to identify the author of an unclaimed document, by attempting to find traits that will match those of the original author. Authorship identification has a great potential for applications in forensics. It can also be used in identifying chat bots, a form of intelligent software created to mimic the human conversations, by their unique style. The online criminal community is utilizing chat bots as a new way to steal private information and commit fraud and identity theft. The need for identifying chat bots by their style is becoming essential to overcome the danger of online criminal activities. Researchers realized the need to advance the understanding of chat bots and design programs to prevent criminal activities, whether it was an identity theft or even a terrorist threat. The more research work to advance chat bots’ ability to perceive humans, the more duties needed to be followed to confront those threats by the research community. This research went further by trying to study whether chat bots have behavioral drift. Studying text for Stylometry has been the goal for many researchers who have experimented many features and combinations of features in their experiments. A novel feature has been proposed that represented Term Frequency Inverse Document Frequency (TFIDF) and implemented that on a Byte level N-Gram. Term Frequency-Inverse Token Frequency (TF-ITF) used these terms and created the feature. The initial experiments utilizing collected data demonstrated the feasibility of this approach. Additional versions of the feature were created and tested for authorship identification. Results demonstrated that the feature was successfully used to identify authors of text, and additional experiments showed that the feature is language independent. The feature successfully identified authors of a German text. Furthermore, the feature was used in text similarities on a book level and a paragraph level. Finally, a selective combination of features was used to classify text that ranges from kindergarten level to scientific researches and novels. The feature combination measured the Quality of Writing (QoW) and the complexity of text, which were the first step to correlate that with the author’s IQ as a future goal

University of Louisville

Genre analysis of online encyclopedias : the case of Wikipedia

Author: Tereszkiewicz Anna
Publication venue: 'Uniwersytet Jagiellonski - Wydawnictwo Uniwersytetu Jagiellonskiego'
Publication date: 01/01/2010
Field of study

Jagiellonian Univeristy Repository

Characterization of Prose by Rhetorical Structure for Machine Learning Classification

Author: Java James
Publication venue: NSUWorks
Publication date: 01/01/2015
Field of study

Measures of classical rhetorical structure in text can improve accuracy in certain types of stylistic classification tasks such as authorship attribution. This research augments the relatively scarce work in the automated identification of rhetorical figures and uses the resulting statistics to characterize an author\u27s rhetorical style. These characterizations of style can then become part of the feature set of various classification models. Our Rhetorica software identifies 14 classical rhetorical figures in free English text, with generally good precision and recall, and provides summary measures to use in descriptive or classification tasks. Classification models trained on Rhetorica\u27s rhetorical measures paired with lexical features typically performed better at authorship attribution than either set of features used individually. The rhetorical measures also provide new stylistic quantities for describing texts, authors, genres, etc

NSU Works

Shakespeare: editions and textural studies

Author: Gabriel Egan (7146041)
Publication venue
Publication date: 01/01/2006
Field of study

Shakespeare: editions and textural studie

Loughborough University Institutional Repository

Recommended from our members

Helen Epigrammatopoios

Author: Elmer David Franklin
Publication venue: 'University of California Press'
Publication date: 23/11/2009
Field of study

Ancient commentators identify several passages in the Iliad as “epigrams.” This paper explores the consequences of taking the scholia literally and understanding these passages in terms of inscription. Two tristichs spoken by Helen in the teikhoskopia are singled out for special attention. These lines can be construed not only as epigrams in the general sense, but more specifically as captions appended to an image of the Achaeans encamped on the plain of Troy. Since Helen's lines to a certain extent correspond to the function and style of catalogic poetry, reading them specifically as captions leads to a more nuanced understanding of both Homeric poetry and Homeric self-reference. By contrasting Helen's “epigrams” with those of Hektor, one can also discern a gender-based differentiation of poetic functions.The Classic

Harvard University - DASH

Weathered Words : Formulaic Language and Verbal Art

Author
Publication venue: The Milman Parry Collection of Oral Literature, Harvard University
Publication date: 07/05/2022
Field of study

Formulaic phraseology presents the epitome of words worn and weathered by trial and the tests of time. Scholarship on weathered words is exceptionally diverse and interdisciplinary. This volume focuses on verbal art, which makes Oral-Formulaic Theory (OFT) a major point of reference. Yet weathered words are but a part of OFT, and OFT is only a part of scholarship on weathered words. Each of the eighteen essays gathered here brings particular aspects of formulaic language into focus. No volume on such a diverse topic can be all-encompassing, but the essays highlight aspects of the phenomenon that may be eclipsed elsewhere: they diverge not only in style, but sometimes even in how they choose to define “formula.” As such, they offer overlapping frames that complement one another both in their convergences and their contrasts. While they view formulaicity from multifarious angles, they unite in a Picasso of perspectives on which the reader can reflect and draw insight.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Automatic Image Captioning with Style

Author: Mathews Alexander Patrick
Publication venue
Publication date: 01/01/2018
Field of study

This thesis connects two core topics in machine learning, vision and language. The problem of choice is image caption generation: automatically constructing natural language descriptions of image content. Previous research into image caption generation has focused on generating purely descriptive captions; I focus on generating visually relevant captions with a distinct linguistic style. Captions with style have the potential to ease communication and add a new layer of personalisation. First, I consider naming variations in image captions, and propose a method for predicting context-dependent names that takes into account visual and linguistic information. This method makes use of a large-scale image caption dataset, which I also use to explore naming conventions and report naming conventions for hundreds of animal classes. Next I propose the SentiCap model, which relies on recent advances in artificial neural networks to generate visually relevant image captions with positive or negative sentiment. To balance descriptiveness and sentiment, the SentiCap model dynamically switches between two recurrent neural networks, one tuned for descriptive words and one for sentiment words. As the first published model for generating captions with sentiment, SentiCap has influenced a number of subsequent works. I then investigate the sub-task of modelling styled sentences without images. The specific task chosen is sentence simplification: rewriting news article sentences to make them easier to understand. For this task I design a neural sequence-to-sequence model that can work with limited training data, using novel adaptations for word copying and sharing word embeddings. Finally, I present SemStyle, a system for generating visually relevant image captions in the style of an arbitrary text corpus. A shared term space allows a neural network for vision and content planning to communicate with a network for styled language generation. SemStyle achieves competitive results in human and automatic evaluations of descriptiveness and style. As a whole, this thesis presents two complete systems for styled caption generation that are first of their kind and demonstrate, for the first time, that automatic style transfer for image captions is achievable. Contributions also include novel ideas for object naming and sentence simplification. This thesis opens up inquiries into highly personalised image captions; large scale visually grounded concept naming; and more generally, styled text generation with content control

The Australian National University

Naming and Renaming Texts: Rubrics in Middle High German Miscellany Manuscripts

Author: Gustavo
Publication venue: Milano University Press
Publication date: 31/12/2021
Field of study

This article analyses rubrics in Middle High German miscellany manuscripts of short texts in rhyming couplets (Reimpaargedichte). A corpus consisting of 1433 rubrics from 68 manuscripts was created to be able to perform this study. As rubrics in medieval manuscripts were not authorial, but composed by scribes, they offer insights into the reception of the texts. This paper analyses their features and functions as a proxy to interrogate the standing and status of Reimpaargedichte between the thirteenth and fifteenth centuries. The main methodology is distant reading, i.e. the application and interpretation of statistical methods on a textual corpus. The features analyzed include the length of the rubrics, their level of variation, the presence of author names, and vocabulary. Although no general patterns regarding length nor level of variation were detected, some important conclusions can be drawn: 1. there were no clear markers of literary genre in rubrics; 2. authorship was mostly absent, except for some specific cases of famous authors; 3. relatively stable keywords were used to identify particular texts, but they were more common in manuscripts with narrative texts (Erzählungen) and less common in later manuscripts dominated by the genre known as Minnereden. Furthermore, the analysis revealed that rubrics used a series of linguistic procedures to show that they participated in a different speech act than the main text – they embodied an interaction between scribes and readers, in which the former framed the reception of the work

Riviste UNIMI