Skip to main content
Article thumbnail
Location of Repository

More blogging features for author identification

By Haytham Mohtasseb and Amr Ahmed


In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. \ud \ud Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets

Topics: G700 Artificial Intelligence, G760 Machine Learning, G720 Knowledge Representation
Year: 2009
OAI identifier:

Suggested articles


  1. (2007). Applied Text Analytics for Blogs. Universiteit van
  2. (2005). Applying authorship analysis to extremist-group web forum messages. doi
  3. (2008). Authorship discovery in blogs using bayesian classification with corrective scaling,
  4. (2005). Data mining: Practical machine learning tools and techniques. doi
  5. (2003). Exploiting stylistic idiosyncrasies for authorship attribution.
  6. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. doi
  7. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. doi
  8. (2001). Linguistic inquiry and word count: Liwc doi
  9. (1999). Linguistic styles: language use as an individual difference. doi
  10. (2001). Mining e-mail content for author identification forensics. doi
  11. (2003). Personality and language: The projection and perception of personality in computer-mediated communication,
  12. (2006). Short text authorship attribution via sequence kernels, markov chains and author unmasking. doi
  13. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.