19 research outputs found

    Authorship Attribution on the Enron Email Corpus

    Get PDF
    In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest

    CEAI: CCM based Email Authorship Identification Model

    Full text link
    In this paper we present a model for email authorship identification (EAI) by employing a Cluster-based Classification (CCM) technique. Traditionally, stylometric features have been successfully employed in various authorship analysis tasks; we extend the traditional feature-set to include some more interesting and effective features for email authorship identification (e.g. the last punctuation mark used in an email, the tendency of an author to use capitalization at the start of an email, or the punctuation after a greeting or farewell). We also included Info Gain feature selection based content features. It is observed that the use of such features in the authorship identification process has a positive impact on the accuracy of the authorship identification task. We performed experiments to justify our arguments and compared the results with other base line models. Experimental results reveal that the proposed CCM-based email authorship identification model, along with the proposed feature set, outperforms the state-of-the-art support vector machine (SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25 authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5% accuracy has been achieved on authors' constructed real email dataset. The results on Enron dataset have been achieved on quite a large number of authors as compared to the models proposed by Iqbal et al. [1, 2]

    A Fuzzy approach for detecting anomalous behaviour in e-mail traffic

    Get PDF
    This paper investigates the use of fuzzy inference for detection of abnormal changes in email traffic communication behaviour. Several communication behaviour measures and metrics are defined for extracting information on the traffic communication behaviour of email users. The information from these behaviour measures is then combined using a hierarchy of fuzzy inference systems, to provide an abnormality rating for overall changes in communication behaviour of suspect email accounts. The use of fuzzy inference is then demonstrated with a case study investigating the email traffic behaviour of a person’s email accounts from the Enron email corpus

    A New Methodological Frontier in Entrepreneurship Research: Big Data Studies

    Get PDF
    The emergence of \u27big data\u27 and related analytic techniques are creating opportunities to advance empirical entrepreneurship theory and practice. This editorial focuses on the implications for the design and execution of empirical studies. It offers guidance on how to navigate related methodological challenges and outlines what editors, professional associations, research-method teachers and administrators can do to enable high-quality big data research

    AN EXPLORATION OF SOCIAL MEDIA IN EXTREME EVENTS: RUMOR THEORY AND TWITTER DURING THE HAITI EARTHQUAKE 2010

    Get PDF
    Due to its rapid speed of information spread, wide user bases, and extreme mobility, Twitter is drawing attention as a potential emergency reporting tool under extreme events. However, at the same time, Twitter is sometimes despised as a citizen based non-professional social medium for propagating misinformation, rumors, and, in extreme case, propaganda. This study explores the working dynamics of the rumor mill by analyzing Twitter data of the Haiti Earthquake in 2010. For this analysis, two key variables of anxiety and informational uncertainty are derived from rumor theory, and their interactive dynamics are measured by both quantitative and qualitative methods. Our research finds that information with credible sources contribute to suppress the level of anxiety in Twitter community, which leads to rumor control and high information quality

    Non‐parametric regression for networks

    Get PDF
    Network data are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices conditional on a set of Euclidean covariates, for example in dynamic networks where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics. We apply the methodology to the Enron email corpus to model smooth trends in monthly networks and highlight anomalous networks. Another motivating application is given in corpus linguistics, which explores trends in an author's writing style over time based on word co-occurrence networks

    UNDERSTANDING COMMUNICATION NETWORK COHESIVENESS DURING ORGANIZATIONAL CRISIS: EFFECTS OF CLIQUE AND TRANSITIVITY

    Get PDF
    Various terms such as organizational mortality, organizational death, bankruptcy, decline, retrenchment and failure have been used in the literature to characterize different forms and facets of organizational crisis. Communication network studies have typically focused on nodes (individuals or organizations), relationships between those nodes, and subsequent affects of these relationships upon the network as a whole. Email networks in contemporary organizations are fairly representative of the underlying communications networks. We show that changes in communication networks and its associated group cohesiveness have implications for studying organizational crisis. In this paper, we analyze the changing communication network structure at Enron Corporation during the period of its crisis (2000-2001). Our goal was to understand how communication patterns and structure were affected by organizational crisis. Drawing on communication network crisis and group cohesiveness theory, we tested several propositions using the Enron email corpus: (1) Number of cliques increases, and (2) Communication network becomes increasingly transitive as organizations experience crisis. The results of the tests and their implications are discussed in this paper

    Social Network Analysis and Organizational Disintegration: The Case of Enron Corporation

    Get PDF
    Email networks in contemporary organizations are fairly representative of the underlying communications networks. We show that changes in communication networks have implications for studying organization disintegration. In this paper, we analyzed the changing communication network structure at Enron Corporation during the period of its disintegration (2000-2001). Our goal was to understand how communication patterns and structure were affected by organizational disintegration. Drawing on (social) network disintegration theory, we tested several propositions using the Enron email corpus: (1) Number of cliques increases (2) Communication network becomes increasingly centralized, and (3) Connectedness among the top management executives increases, as organizations move towards disintegration. The results of the tests and their implications are discussed.link_to_subscribed_fulltex
    corecore