Location of Repository

Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

By Haytham Mohtasseb and Amr Ahmed

Abstract

Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity

Topics: G700 Artificial Intelligence, G760 Machine Learning, G720 Knowledge Representation
Year: 2009
OAI identifier: oai:eprints.lincoln.ac.uk:2106

Suggested articles

Preview

Citations

  1. (2005). Applying authorship analysis to extremist-group web forum messages. doi
  2. (2008). Authorship discovery in blogs using bayesian classification with corrective scaling,
  3. (2009). Automatically profiling the author of an anonymous text. doi
  4. (2007). Blogs: Antiforensics and counter anti-forensics.
  5. (2003). Citation-based retrieval for scholarly publications. doi
  6. (2005). Data Mining: Practical Machine Learning Tools and Techniques. doi
  7. (2005). Determining an author’s native language by mining a text for errors. doi
  8. (2005). Educator’s guide to cyberbullying addressing the harm caused by online social cruelty. Accessible at http://cyberbullying.org, 19,
  9. (2003). Exploiting stylistic idiosyncrasies for authorship attribution.
  10. (2007). Expressing emotion in textbased communication. doi
  11. (2006). Feature instability as a criterion for selecting potential style markers. doi
  12. (2006). From fingerprint to writeprint. doi
  13. (2008). I’m sad you’re sad: emotional contagion in cmc. doi
  14. (2001). Linguistic inquiry and word count: Liwc. Mahway : Lawrence Erlbaum Associates,
  15. (2001). Mining e-mail content for author identification forensics. doi
  16. (2009). Mining online diaries for blogger identification.
  17. (2009). More blogging features for author identification.
  18. (2003). Personality and language: The projection and perception of personality in computer-mediated communication,
  19. (2003). Style mining of electronic messages for multiple authorship discrimination: first results. doi
  20. (2006). The identity of bloggers: Openness and gender in personal weblogs.
  21. (2008). The language of emotion in short blog texts. doi
  22. The porter stemming algorithm. Accessible at http://www.tartarus.org/martin/PorterStemmer.
  23. (2007). Using linguistic cues for the automatic recognition of personality in conversation and text.
  24. (2006). Visualization and clustering of author social networks.
  25. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.