Location of Repository

Mining online diaries for blogger identification

By Haytham Mohtasseb and Amr Ahmed

Abstract

In this paper, we present an investigation of authorship\ud identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance.\ud \ud Many studies manipulated the problem of authorship\ud identification in manually collected corpora, but only few\ud utilized real data from existing blogs. The complexity of\ud the language model in personal blogs is motivating to\ud identify the correspondent author. The main contribution\ud of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been\ud developed with Psychology background, for the first time\ud for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. \ud \ud Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs.\ud \ud The results and evaluation show that the utilized features\ud are compact while their performance is highly comparable\ud with other larger feature sets. The analysis also confirmed\ud the most effective parameters, their ranges in the data\ud corpus, and the usefulness of the common users classifier\ud in improving the performance, for the author identification\ud task

Topics: G700 Artificial Intelligence, G760 Machine Learning, G400 Computer Science, G720 Knowledge Representation
Year: 2009
OAI identifier: oai:eprints.lincoln.ac.uk:1857

Suggested articles

Preview

Citations

  1. (2007). Applied Text Analytics for Blogs. Universiteit van
  2. (2005). Applying authorship analysis to extremist-group web forum messages. doi
  3. (2008). Authorship discovery in blogs using bayesian classification with corrective scaling,
  4. (2005). Data Mining: Practical Machine Learning Tools and Techniques. doi
  5. (2008). Identifying the influential bloggers in a community. doi
  6. (1964). Inference and disputed authorship: The Federalist. doi
  7. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. doi
  8. (2001). Linguistic inquiry and word count: Liwc doi
  9. (2001). Linguistic markers of psychological change surrounding september 11, doi
  10. (1999). Linguistic styles: language use as an individual difference. doi
  11. (2001). Mining e-mail content for author identification forensics. doi
  12. (1987). Mrc psycholinguistic database: Machine usable dictionary. doi
  13. (2003). Personality and language: The projection and perception of personality in computer-mediated communication,
  14. (2006). Short text authorship attribution via sequence kernels, markov chains and author unmasking. doi
  15. (2006). The identity of bloggers: Openness and gender in personal weblogs.
  16. (1963). Toward an adequate taxonomy of personality attributes: replicated factors structure in peer nomination personality ratings. Journal of abnormal and social psychology, doi
  17. (2007). Using linguistic cues for the automatic recognition of personality in conversation and text.
  18. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.