More blogging features for author identification

Mohtasseb, Haytham; Ahmed, Amr

research

oai:eprints.lincoln.ac.uk:1862

More blogging features for author identification

Authors: Haytham Mohtasseb
Amr Ahmed
Publication date: 25 December 2009
Publisher

Abstract

In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets

Similar works

Full text

Open in the Core reader

Download PDF

University of Lincoln Institutional Repository

oai:eprints.lincoln.ac.uk:1862

Last time updated on 28/06/2012

This paper was published in University of Lincoln Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.