2,368 research outputs found
Language Trees and Zipping
In this letter we present a very general method to extract information from a
generic string of characters, e.g. a text, a DNA sequence or a time series.
Based on data-compression techniques, its key point is the computation of a
suitable measure of the remoteness of two bodies of knowledge. We present the
implementation of the method to linguistic motivated problems, featuring highly
accurate results for language recognition, authorship attribution and language
classification.Comment: 5 pages, RevTeX4, 1 eps figure. In press in Phys. Rev. Lett. (January
2002
A multi-input deep learning model for C/C++ source code attribution
Code stylometry is applying analysis techniques to a collection of source code or binaries to determine variations in style. The variations extracted are often used to identify the author of the text or to differentiate one piece from another.
In this research, we were able to create a multi-input deep learning model that could accurately categorize and group code from multiple projects. The deep learning model took as input word-based tokenization for code comments, character-based tokenization for the source code text, and the metadata features described by A. Caliskan-Islam et al. Using these three inputs, we were able to achieve 90% validation accuracy with a loss value of 0.1203 using 12 projects consisting of 5,877 files. Finally, we analyzed the Bitcoin source code using our data model showing a high probability match to the OpenSSL project
Machine Learning Techniques for Topic Detection and Authorship Attribution in Textual Data
The unprecedented expansion of user-generated content in recent years demands more attempts of information filtering in order to extract high-quality information from the huge amount of available data. In this dissertation, we begin with a focus on topic detection from microblog streams, which is the first step toward monitoring and summarizing social data. Then we shift our focus to the authorship attribution task, which is a sub-area of computational stylometry. It is worth mentioning that determining the style of a document is orthogonal to determining its topic, since the document features which capture the style are mainly independent of its topic. We initially present a frequent pattern mining approach for topic detection from microblog streams. This approach uses a Maximal Sequence Mining (MSM) algorithm to extract pattern sequences, where each pattern sequence is an ordered set of terms. Then we construct a pattern graph, which is a directed graph representation of the mined sequences, and apply a community detection algorithm to group the mined patterns into different topic clusters. Experiments on Twitter datasets demonstrate that the MSM approach achieves high performance in comparison with the state-of-the-art methods. For authorship attribution, while previously proposed neural models in the literature mainly focus on lexical-based neural models and lack the multi-level modeling of writing style, we present a syntactic recurrent neural network to encode the syntactic patterns of a document in a hierarchical structure. The proposed model learns the syntactic representation of sentences from the sequence of part-of-speech tags. Furthermore, we present a style-aware neural model to encode document information from three stylistic levels (lexical, syntactic, and structural) and evaluate it in the domain of authorship attribution. Our experimental results, based on four authorship attribution benchmark datasets, reveal the benefits of encoding document information from all three stylistic levels when compared to the baseline methods in the literature. We extend this work and adopt a transfer learning approach to measure the impact of lower-level linguistic representations versus higher-level linguistic representations on the task of authorship attribution. Finally, we present a self-supervised framework for learning structural representations of sentences. The self-supervised network is a Siamese network with two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. This model is trained based on a contrastive loss objective. As a result, each word in the sentence is embedded into a vector representation which mainly carries structural information. The learned structural representations can be concatenated to the existing pre-trained word embeddings and create style-aware embeddings that carry both semantic and syntactic information and is well-suited for the domain of authorship attribution
Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?
Authorship verification is the problem of determining if two distinct writing
samples share the same author and is typically concerned with the attribution
of written text. In this paper, we explore the attribution of transcribed
speech, which poses novel challenges. The main challenge is that many stylistic
features, such as punctuation and capitalization, are not available or
reliable. Therefore, we expect a priori that transcribed speech is a more
challenging domain for attribution. On the other hand, other stylistic
features, such as speech disfluencies, may enable more successful attribution
but, being specific to speech, require special purpose models. To better
understand the challenges of this setting, we contribute the first systematic
study of speaker attribution based solely on transcribed speech. Specifically,
we propose a new benchmark for speaker attribution focused on conversational
speech transcripts. To control for spurious associations of speakers with
topic, we employ both conversation prompts and speakers' participating in the
same conversation to construct challenging verification trials of varying
difficulties. We establish the state of the art on this new benchmark by
comparing a suite of neural and non-neural baselines, finding that although
written text attribution models achieve surprisingly good performance in
certain settings, they struggle in the hardest settings we consider
- …