Automatic authorship analysis using Deep neural networks

Abstract

Authorship analysis helps to study the characteristics that distinguish how two different persons write. Writing style can be extracted in several ways, like using bag of words strategies or handcrafted features. However, with the growing of Internet, we have been able to witness an increase in the amount of user generated data in social networks like Facebook or Twitter. There is an increasing need in generating automatic methods capable of analyzing the style of a document for tasks like: determining the age of the author, determining the gender of the author, determining the authorship of the document given a set of possible authors, etc. Previous tasks are better known as author profiling and authorship attribution. Although capturing the style of an author can be a challenging task, in this thesis we explore representation learning strategies, in order to take advantage of the large amount of data generated by social media. In this thesis, we learned proper representations for the text inputs that were able to learn such patterns that are only distinguishable to an author (authorship attribution) or a social group of authors (author profiling). Proposed methods were compared using different publicly available datasets using social media data. Both author profiling and authorship attribution tasks are addressed using representation learning techniques such as convolutional neural networks and gated multimodal units. Our unimodal author profiling approach was submitted to the profiling shared task of the laboratory on digital forensics and stylometry(PAN). For authorship attribution, we proposed a convolutional neural network using character n-grams as input. We found that our approach outperformed standard attribution based methods as well as word based convolutional neural networks. For the author profiling task, we proposed one convolutional neural network for unimodal author profiling and adapted a gated multimodal unit for multimodal author profiling. The multimodal nature of user generated content consists of a scenario where the social group of an author can be determined not only using his/her written texts but using also the images that the user shared across the social networks. Gated multimodal units outperformed standard information fusion strategies: early and late fusion.Maestrí

    Similar works