54,487 research outputs found

    Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

    Get PDF
    Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity

    CEAI: CCM based Email Authorship Identification Model

    Full text link
    In this paper we present a model for email authorship identification (EAI) by employing a Cluster-based Classification (CCM) technique. Traditionally, stylometric features have been successfully employed in various authorship analysis tasks; we extend the traditional feature-set to include some more interesting and effective features for email authorship identification (e.g. the last punctuation mark used in an email, the tendency of an author to use capitalization at the start of an email, or the punctuation after a greeting or farewell). We also included Info Gain feature selection based content features. It is observed that the use of such features in the authorship identification process has a positive impact on the accuracy of the authorship identification task. We performed experiments to justify our arguments and compared the results with other base line models. Experimental results reveal that the proposed CCM-based email authorship identification model, along with the proposed feature set, outperforms the state-of-the-art support vector machine (SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25 authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5% accuracy has been achieved on authors' constructed real email dataset. The results on Enron dataset have been achieved on quite a large number of authors as compared to the models proposed by Iqbal et al. [1, 2]

    Coordination, Division of Labor, and Open Content Communities: Template Messages in Wiki-Based Collections

    Get PDF
    In this paper we investigate how in commons based peer production a large community of contributors coordinates its efforts towards the production of high quality open content. We carry out our empirical analysis at the level of articles and focus on the dynamics surrounding their production. That is, we focus on the continuous process of revision and update due to the spontaneous and largely uncoordinated sequence of contributions by a multiplicity of individuals. We argue that this loosely regulated process, according to which any user can make changes to any entry, while allowing highly creative contributions, has to come into terms with potential issues with respect to the quality and consistency of the output. In this respect, we focus on emergent, bottom up organizational practice arising within the Wikipedia community, namely the use of template messages, which seems to act as an effective and parsimonious coordination device in emphasizing quality concerns (in terms of accuracy, consistency, completeness, fragmentation, and so on) or in highlighting the existence of other particular issues which are to be addressed. We focus on the template "NPOV" which signals breaches on the fundamental policy of neutrality of Wikipedia articles and we show how and to what extent imposing such template on a page affects the production process and changes the nature and division of labor among participants. We find that intensity of editing increases immediately after the "NPOV" template appears. Moreover, articles that are treated most successfully, in the sense that "NPOV" disappears again relatively soon, are those articles which receive the attention of a limited group of editors. In this dimension at least the distribution of tasks in Wikipedia looks quite similar to what is know about the distribution in the FLOSS development process
    • ā€¦
    corecore