35 research outputs found
A robust authorship attribution on big period
Authorship attribution is a task to identify the writer of unknown text and categorize it to known writer. Writing style of each author is distinct and can be used for the discrimination. There are different parameters responsible for rectifying such changes. When the writing samples collected for an author when it belongs to small period, it can participate efficiently for identification of unknown sample. In this paper author identification problem considered where writing sample is not available on the same time period. Such evidences collected over long period of time. And character n-gram, word n-gram and pos n-gram features used to build the model. As they are contributing towards style of writer in terms of content as well as statistic characteristic of writing style. We applied support vector machine algorithm for classification. Effective results and outcome came out from the experiments. While discriminating among multiple authors, corpus selection and construction were the most tedious task which was implemented effectively. It is observed that accuracy varied on feature type. Word and character n-gram have shown good accuracy than PoS n-gram
“Because the computer said so!”: Can computational authorship analysis be trusted?
This study belongs to the domain of authorship analysis (AA), a discipline under the umbrella of forensic linguistics in which writing style is analysed as a means of authorship identification.
Due to advances in natural language processing and machine learning in recent years, interest in computational methods of AA is gaining over traditional stylistic analysis by human experts. It may only be a matter of time before the software will assist, if not replace, a forensic examiner. But can we trust its verdict? The existing computational methods of AA receive critique for the lack of theoretical motivation, black box methodologies and controversial results, and ultimately, many argue that these are unable to deliver viable forensic evidence.
The study replicates a popular algorithm of computational AA in order to open one of the existing black boxes. It takes a closer look at the so-called “bag-of-words” (BoW) approach – a word distributions method used in the majority of AA models, evaluates the parameters that the algorithm bases its conclusions on and offers detailed linguistic explanations for the statistical results these discriminators produce.
The framework behind the design of this study draws on multidimensional analysis – a multivariate analytical approach to linguistic variation. By building on the theory of systemic functional linguistics and variationist sociolinguistics, the study takes steps toward solving the existing problem of the theoretical validity of computational AA
Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study in Polish
In stylometric investigations, frequencies of the most frequent words (MFWs)
and character n-grams outperform other style-markers, even if their performance
varies significantly across languages. In inflected languages, word endings
play a prominent role, and hence different word forms cannot be recognized
using generic text tokenization. Countless inflected word forms make
frequencies sparse, making most statistical procedures complicated. Presumably,
applying one of the NLP techniques, such as lemmatization and/or parsing, might
increase the performance of classification. The aim of this paper is to examine
the usefulness of grammatical features (as assessed via POS-tag n-grams) and
lemmatized forms in recognizing authorial profiles, in order to address the
underlying issue of the degree of freedom of choice within lexis and grammar.
Using a corpus of Polish novels, we performed a series of supervised authorship
attribution benchmarks, in order to compare the classification accuracy for
different types of lexical and syntactic style-markers. Even if the performance
of POS-tags as well as lemmatized forms was notoriously worse than that of
lexical markers, the difference was not substantial and never exceeded ca. 15%
Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybius’ excerptor, who leaves conspicuous syntactic indicators of his modifications
Approaching Questions of Text Reuse in Ancient Greek Using Computational Syntactic Stylometry
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain syntactic simplification is a salient characteristic of Polybius’ excerptor, who leaves conspicuous syntactic indicators of his modifications
Ein Schlachtfeld der Zuschreibung von Autorschaft. Musils propagandistische Beiträge in der Frontzeitung «Heimat» (1918): [A battlefield for authorship attribution. Musil’s propaganda contributions in the soldier’s newspaper «Heimat» (1918)]
This study focuses on Musil’s contributions to Heimat, a propaganda newspaper published by the k.u.k. Kriegspressequartier during the last months of World War I. As the authorship of the Heimat articles is controversial, we performed a series of stylometric analyses, which allowed us to attribute ten texts to the Austrian writer. Our approach introduces new elements and data into the debate on authorship, thus opening a productive dialogue between computational, archival and stylistic research
Crossing linguistic barriers: authorship attribution in Sinhala texts
Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. The author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages such as Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of a small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author
Recommended from our members
A Stylometric Analysis of Climate Change Fiction
This work sets out to analyze stylistic changes in Anthropocene fiction over the past 60 years. The starting point for the analysis has been Rachel Carson, and the presumed beginning of the Anthropocene in the 1960s. The primary insight gained reveals the connections within these novel and relations of similar writing about climate change thereby contributing to the field of Environmental Humanities in a fundamental way, as so far, climate change fiction has only been investigated through a topic centered focus.
The corpus compiled for scrutiny here extends to over 84 novels from these years. These novels have been selected based on a dual approach, looking at the secondary literature as well as a crowdsourced approach in looking at Good Reads’ cli-fi lists. The resulting texts are then analyzed with stylo, an R package that has been specifically created for stylometric analysis by humanists. The results are visualized in a network that allows easier interpretation and leads to an understanding of more detailed questions about the nature of the connection between works, the inspiration and representation of a specific genre of writing. Moreover, the thesis looks diachronically at clustering based on time and topic. Understanding the ways in which authors address and have addressed climate change is one indicator of how climate change is and has been comprehended.
In terms of the digital approach applied here, the basis is a distant reading approach covering a larger number of novels and rather than close reading them, the task is to find patterns that extend throughout. However, for a thorough analysis, scalable reading is applied to contextualize and investigate the results in more depth. Overall, the results are meant to establish a baseline for discussing climate change fiction in the Anthropocene which although gaining more scholarly attention still is understudied. The hope is to not only gain insight but to generate visualizations that will provide a helpful resource for fellow scholars