147 research outputs found
Authorship Attribution Using Principal Component Analysis and Nearest Neighbor Rule for Neural Networks
Feature extraction is a common problem in statistical pattern recognition. It refers to a process whereby a data space is transformed into a feature space that, in theory, has exactly the same dimension as the original data space. However, the transformation is designed in such a way that the data set may be represented by a reduced number of "effective" features and yet retain most of the intrinsic information content of the data; in other words, the data set undergoes a dimensionality reduction. Principal component analysis is one of these processes. In this paper the data collected by counting selected syntactic characteristics in around a thousand paragraphs of each of the sample books underwent a principal component analysis. To make a comparison, the original data is also processed. Authors of texts identified with higher success by the competitive neural networks, which use principal components. The process repeated on another group of authors, and similar results are obtained
Effective authorship attribution in large document collections
Techniques that can effectively identify authors of texts are of great importance in scenarios such as detecting plagiarism, and identifying a source of information. A range of attribution approaches has been proposed in recent years, but none of these are particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and computational cost. Good test collections are critical for evaluation of authorship attribution (AA) techniques. However, there are no standard benchmarks available in this area; it is almost always the case that researchers have their own test collections. Furthermore, collections that have been explored in AA are usually small, and thus whether the existing approaches are reliable or scalable is unclear. We develop several AA collections that are substantially larger than those in literature; machine learning methods are used to establish the value of using such corpora in AA. The results, also used as baseline results in this thesis, show that the developed text collections can be used as standard benchmarks, and are able to clearly distinguish between different approaches. One of the major contributions is that we propose use of the Kullback-Leibler divergence, a measure of how different two distributions are, to identify authors based on elements of writing style. The results show that our approach is at least as effective as, if not always better than, the best existing attribution methods-that is, support vector machines-for two-class AA, and is superior for multi-class AA. Moreover our proposed method has much lower computational cost and is cheaper to train. Style markers are the key elements of style analysis. We explore several approaches to tokenising documents to extract style markers, examining which marker type works the best. We also propose three systems that boost the AA performance by combining evidence from various marker types, motivated from the observation that there is no one type of marker that can satisfy all AA scenarios. To address the scalability of AA, we propose the novel task of authorship search (AS), inspired by document search and intended for large document collections. Our results show that AS is reasonably effective to find documents by a particular author, even within a collection consisting of half a million documents. Beyond search, we also propose the AS-based method to identify authorship. Our method is substantially more scalable than any method published in prior AA research, in terms of the collection size and the number of candidate authors; the discrimination is scaled up to several hundred authors
Dating Victorians: an experimental approach to stylochronometry
A thesis submitted for the degree of Doctor of Philosophy ofthe University of LutonThe writing style of a number of authors writing in English was empirically investigated for the purpose of detecting stylistic patterns in relation to advancing age. The aim was to identify the type of stylistic markers among lexical, syntactical, phonemic, entropic, character-based, and content ones that would be most able to discriminate between early, middle, and late works of the selected authors, and the best classification or prediction algorithm most suited for this task. Two pilot studies were initially conducted. The first one concentrated on Christina Georgina Rossetti and Edgar Allan Poe from whom personal letters and poetry were selected as the genres of study, along with a limited selection of variables. Results suggested that authors and genre vary inconsistently. The second pilot study was based on Shakespeare's plays using a wider selection of variables to assess their discriminating power in relation to a past study. It was observed that the selected variables were of satisfactory predictive power, hence judged suitable for the task. Subsequently, four experiments were conducted using the variables tested in the second pilot study and personal correspondence and poetry from two additional authors, Edna St Vincent Millay and William Butler Yeats. Stepwise multiple linear regression and regression trees were selected to deal with the first two prediction experiments, and ordinal logistic regression and artificial neural networks for two classification experiments. The first experiment revealed inconsistency in accuracy of prediction and total number of variables in the final models affected by differences in authorship and genre. The second experiment revealed inconsistencies for the same factors in terms of accuracy only. The third experiment showed total number of variables in the model and error in the final model to be affected in various degrees by authorship, genre, different variable types and order in which the variables had been calculated. The last experiment had all measurements affected by the four factors.
Examination of whether differences in method within each task play an important part revealed significant influences of method, authorship, and genre for the prediction problems, whereas all factors including method and various interactions dominated in the classification problems. Given the current data and methods used, as well as the results obtained, generalizable conclusions for the wider author population have been avoided
Making Machines Learn. Applications of Cultural Analytics to the Humanities
The digitization of several million books by Google in 2011 meant the popularization of a new kind of humanities research powered by the treatment of cultural objects as data. Culturomics, as it is called, was born, and other initiatives resonated with such a methodological approach, as is the case with the recently formed Digital Humanities or Cultural Analytics. Intrinsically, these new quantitative approaches to culture all borrow from techniques and methods developed under the wing of the exact sciences, such as computer science, machine learning or statistics. There are numerous examples of studies that take advantage of the possibilities that treating objects as data has to offer for the understanding of the human. This new data science that is now applied to the current trends in culture can also be replicated to study more traditional humanities. Led by proper intellectual inquiry, an adequate use of technology may bring answers to questions intractable by other means, or add evidence to long held assumptions based on a canon built from few examples. This dissertation argues in favor of such approach. Three different case studies are considered. First, in the more general sense of the big and smart data, we collected and analyzed more than 120,000 pictures of paintings from all periods of art history, to gain a clear insight on how the beauty of depicted faces, in the framework of neuroscience and evolutionary theory, has changed over time. A second study covers the nuances of modes of emotions employed by the Spanish Golden Age playwright CalderĂłn de la Barca to empathize with his audience. By means of sentiment analysis, a technique strongly supported by machine learning, we shed some light into the different fictional characters, and how they interact and convey messages otherwise invisible to the public. The last case is a study of non-traditional authorship attribution techniques applied to the forefather of the modern novel, the Lazarillo de Tormes. In the end, we conclude that the successful application of cultural analytics and computer science techniques to traditional humanistic endeavours has been enriching and validating
A multi-disciplinary framework for cyber attribution
Effective Cyber security is critical to the prosperity of any nation in the modern world. We have become
dependant upon this interconnected network of systems for a number of critical functions within society.
As our reliance upon this technology has increased, as has the prospective gains for malicious actors who
would abuse these systems for their own personal benefit, at the cost of legitimate users. The result has
been an explosion of cyber attacks, or cyber enabled crimes. The threat from hackers, organised criminals
and even nations states is ever increasing. One of the critical enablers to our cyber security is that of cyber
attribution, the ability to tell who is acting against our systems.
A purely technical approach to cyber attribution has been found to be ineffective in the majority of cases,
taking too narrow approach to the attribution problem. A purely technical approach will provide Indicators
Of Compromise (IOC) which is suitable for the immediate recovery and clean up of a cyber event. It
fails however to ask the deeper questions of the origin of the attack. This can be derived from a wider
set of analysis and additional sources of data. Unfortunately due to the wide range of data types and
highly specialist skills required to perform the deep level analysis there is currently no common framework
for analysts to work together towards resolving the attribution problem. This is further exasperated by a
communication barrier between the highly specialised fields and no obviously compatible data types.
The aim of the project is to develop a common framework upon which experts from a number of disciplines
can add to the overall attribution picture. These experts will add their input in the form of a library. Firstly
a process was developed to enable the creation of compatible libraries in different specialist fields. A series
of libraries can be used by an analyst to create an overarching attribution picture. The framework will
highlight any intelligence gaps and additionally an analyst can use the list of libraries to suggest a tool or
method to fill that intelligence gap.
By the end of the project a working framework had been developed with a number of libraries from a
wide range of technical attribution disciplines. These libraries were used to feed in real time intelligence
to both technical and nontechnical analysts who were then able to use this information to perform in depth
attribution analysis. The pictorial format of the framework was found to assist in the breaking down of
the communication barrier between disciplines and was suitable as an intelligence product in its own right,
providing a useful visual aid to briefings. The simplicity of the library based system meant that the process
was easy to learn with only a short introduction to the framework required
The anonymous 1821 translation of Goethe's Faust :a cluster analytic approach
PhD ThesisThis study tests the hypothesis proposed by Frederick Burwick and James McKusick in
2007 that Samuel Taylor Coleridge was the author of the anonymous translation of
Goethe's Faust published by Thomas Boosey in 1821. The approach to hypothesis testing
is stylometric. Specifically, function word usage is selected as the stylometric criterion,
and 80 function words are used to define a 73-dimensional function word frequency
profile vector for each text in the corpus of Coleridge's literary works and for a selection
of works by a range of contemporary English authors. Each profile vector is a point in 80-
dimensional vector space, and cluster analytic methods are used to determine the
distribution of profile vectors in the space. If the hypothesis being tested is valid, then the
profile for the 1821 translation should be closer in the space to works known to be by
Coleridge than to works by the other authors. The cluster analytic results show, however,
that this is not the case, and the conclusion is that the Burwick and McKusick hypothesis
is falsified relative to the stylometric criterion and analytic methodology used
Past, present and future of historical information science
Der Bericht evaluiert Entwicklungen und Einflüsse von Forschungen im Bereich der empirisch orientierten Geschichtswissenschaft und deren rechnergestützten Methoden. Vorgestellt werden ein Forschungsparadigma und eine Forschungsinfrastruktur für die zukünftige historisch orientierte Informationswissenschaft. Die entscheidenden Anstöße dafür kommen eher von Außen, also nicht aus der scientific community der Assoziation for History and Computing (AHC). Die Gründe hierfür liegen darin, dass die AHC niemals klare Aussagen darüber gemacht hat, welches ihre Adressaten sind: Historiker, die sich für EDV interessieren, oder historisch orientierte Informationswissenschaftler. Das Ergebnis war, dass sich keine dieser Fraktionen angesprochen fühlte und kein Diskurs mit der 'traditionellen' Geschichtswissenschaft und der Informationswissenschaft zustande kam. Der Autor skizziert ein Forschungsprogramm, das diese Ambiguitäten vermeidet und die Ansätze in einer Forschungsinfrastruktur integriert. (ICAÜbers)'This report evaluates the impact of two decades of research within the framework of history and computing, and sets out a research paradigm and research infrastructure for future historical information science. It is good to see that there has been done a lot of historical information research in the past, much of it has been done, however, outside the field of history and computing, and not within a community like the Association for History and Computing. The reason is that the AHC never made a clear statement about what audience to address: historians with an interest in computing, or historical information scientists. As a result, both parties have not been accommodated, and communications with both 'traditional' history and 'information science' have not been established. A proper research program, based on new developments in information science, is proposed, along with an unambiguous scientific research infrastructure.' (author's abstract
Recommended from our members
The Computational Attitude in Music Theory
Music studies’s turn to computation during the twentieth century has engendered particular habits of thought about music, habits that remain in operation long after the music scholar has stepped away from the computer. The computational attitude is a way of thinking about music that is learned at the computer but can be applied away from it. It may be manifest in actual computer use, or in invocations of computationalism, a theory of mind whose influence on twentieth-century music theory is palpable. It may also be manifest in more informal discussions about music, which make liberal use of computational metaphors. In Chapter 1, I describe this attitude, the stakes for considering the computer as one of its instruments, and the kinds of historical sources and methodologies we might draw on to chart its ascendance. The remainder of this dissertation considers distinct and varied cases from the mid-twentieth century in which computers or computationalist musical ideas were used to pursue new musical objects, to quantify and classify musical scores as data, and to instantiate a generally music-structuralist mode of analysis.
I present an account of the decades-long effort to prepare an exhaustive and accurate catalog of the all-interval twelve-tone series (Chapter 2). This problem was first posed in the 1920s but was not solved until 1959, when the composer Hanns Jelinek collaborated with the computer engineer Heinz Zemanek to jointly develop and run a computer program. Recognizing the transformation wrought on modern statistics and communications technology by information theory, I revisit Abraham Moles’s book Information Theory and Esthetic Perception (orig. 1958) and use its vocabulary to contextualize contemporary information-theoretic work on music that various evokes the computational mind by John. R. Pierce and Mary Shannon, Wilhelm Fucks, and Henry Quastler (Chapter 3). I conclude with a detailed look into a score-segmentation algorithm of the influential American music theorist Allen Forte (Chapter 4). Forte was a skilled programmer who spent several years at MIT in the 1960s, with cutting-edge computers and the company of first-rank figures in the nascent fields of computer science and artificial intelligence. Each one of the researchers whose work is treated in these case studies—at some stage in their relationship with music—adopted what I call the computational attitude to music, to varying degrees and for diverse ends. Of the many questions this dissertation seeks to answer: what was gained by adopting such an attitude? What was lost? Having understood these past explorations of the computational attitude to music, we are better suited ask of ourselves the same questions today
Essays in political text: new actors, new data, new challenges
The essays in this thesis explore diverse manifestations and different aspects of political text. The two main contributions on the methodological side are bringing forward novel data on political actors who were overlooked by the existing literature and application of new approaches in text analysis to address substantive questions about them. On the theoretical side this thesis contributes to the literatures on lobbying, government transparency, post-conflict studies and gender in politics. In the first paper on interest groups in the UK I argue that contrary to much of the theoretical and empirical literature mechanisms of attaining access to government in pluralist systems critically depend on the presence of limits on campaign spending. When such limits exist, political candidates invest few resources in fund-raising and, thus, most organizations make only very few if any political donations. I collect and analyse transparency data on government department meetings and show that economic importance is one of the mechanisms that can explain variation in the level of access attained by different groups. Furthermore, I show that Brexit had a diminishing effect on this relationship between economic importance and the level of access. I also study the reported purpose of meetings and, using dynamic topic models, show the temporary shifts in policy agenda during this period. The second paper argues that civil society in post-conflict settings is capable of high-quality deliberation and, while differing in their focus, both male and female can deliver arguments pertaining to the interests of broader societal groups. Using the transcripts of civil society public consultation meetings across former Yugoslavia I show that the lack of gender-sensitive transitional justice instruments could stem not from the lack of women’s 3 physical or verbal participation, but from the dynamic of speech enclaves and topical focus on different aspects of transitional justice process between genders. And, finally, the third paper maps the challenges that lie ahead with the proliferation of research that relies on multiple datasets. In a simulation study I show that, when the linking information is limited to text, the noise can potential occur at different levels and is often hard to anticipate in practice. Thus, the choice of record linkage requires balancing between these different scenarios. Taken together, the papers in this thesis advance the field of “text as data” and contribute to our understanding of multiple political phenomena
- …