45,803 research outputs found
Two-layer classification and distinguished representations of users and documents for grouping and authorship identification
Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity
An Email Attachment is Worth a Thousand Words, or Is It?
There is an extensive body of research on Social Network Analysis (SNA) based
on the email archive. The network used in the analysis is generally extracted
either by capturing the email communication in From, To, Cc and Bcc email
header fields or by the entities contained in the email message. In the latter
case, the entities could be, for instance, the bag of words, url's, names,
phones, etc. It could also include the textual content of attachments, for
instance Microsoft Word documents, excel spreadsheets, or Adobe pdfs. The nodes
in this network represent users and entities. The edges represent communication
between users and relations to the entities. We suggest taking a different
approach to the network extraction and use attachments shared between users as
the edges. The motivation for this is two-fold. First, attachments represent
the "intimacy" manifestation of the relation's strength. Second, the
statistical analysis of private email archives that we collected and Enron
email corpus shows that the attachments contribute in average around 80-90% to
the archive's disk-space usage, which means that most of the data is presently
ignored in the SNA of email archives. Consequently, we hypothesize that this
approach might provide more insight into the social structure of the email
archive. We extract the communication and shared attachments networks from
Enron email corpus. We further analyze degree, betweenness, closeness, and
eigenvector centrality measures in both networks and review the differences and
what can be learned from them. We use nearest neighbor algorithm to generate
similarity groups for five Enron employees. The groups are consistent with
Enron's organizational chart, which validates our approach.Comment: 12 pages, 4 figures, 7 tables, IML'17, Liverpool, U
- ā¦