82,586 research outputs found

    EMail Data Mining: An Approach to Construct an Organization Position-wise Structure While Performing EMail Analysis

    Get PDF
    In this age of social networking, it is necessary to define the relationships among the members of a social network. Various techniques are already available to define user- to-user relationships across the network. Over time, many algorithms and machine learning techniques were applied to find relationships over social networks, yet very few techniques and information are available to define a relation directly over raw email data. Few educational societies have developed a way to mine the email log files and have found the inter-relation between the users by means of clusters. Again, there is no solid technique available that can accurately predict the ranking of each user within an organization by mining through their email transaction logs. The author in this report presents a technique to mine the email data log files in order to figure out the position wise structure of an organization. The author also discusses send-receive analysis, statistical analysis, semantic analysis and temporal analysis over the data, and has applied them to test cases. Throughout the research the author has used the Enron employees email log files, which was made public on 2001

    Online Social Networks: Measurements, Analysis and Solutions for Mining Challenges

    Get PDF
    In the last decade, online social networks showed enormous growth. With the rise of these networks and the consequent availability of wealth social network data, Social Network Analysis (SNA) led researchers to get the opportunity to access, analyse and mine the social behaviour of millions of people, explore the way they communicate and exchange information. Despite the growing interest in analysing social networks, there are some challenges and implications accompanying the analysis and mining of these networks. For example, dealing with large-scale and evolving networks is not yet an easy task and still requires a new mining solution. In addition, finding communities within these networks is a challenging task and could open opportunities to see how people behave in groups on a large scale. Also, the challenge of validating and optimizing communities without knowing in advance the structure of the network due to the lack of ground truth is yet another challenging barrier for validating the meaningfulness of the resulting communities. In this thesis, we started by providing an overview of the necessary background and key concepts required in the area of social networks analysis. Our main focus is to provide solutions to tackle the key challenges in this area. For doing so, first, we introduce a predictive technique to help in the prediction of the execution time of the analysis tasks for evolving networks through employing predictive modeling techniques to the problem of evolving and large-scale networks. Second, we study the performance of existing community detection approaches to derive high quality community structure using a real email network through analysing the exchange of emails and exploring community dynamics. The aim is to study the community behavioral patterns and evaluate their quality within an actual network. Finally, we propose an ensemble technique for deriving communities using a rich internal enterprise real network in IBM that reflects real collaborations and communications between employees. The technique aims to improve the community detection process through the fusion of different algorithms

    A framework for the forensic investigation of unstructured email relationship data

    Get PDF
    Our continued reliance on email communications ensures that it remains a major source of evidence during a digital investigation. Emails comprise both structured and unstructured data. Structured data provides qualitative information to the forensics examiner and is typically viewed through existing tools. Unstructured data is more complex as it comprises information associated with social networks, such as relationships within the network, identification of key actors and power relations, and there are currently no standardised tools for its forensic analysis. Moreover, email investigations may involve many hundreds of actors and thousands of messages. This paper posits a framework for the forensic investigation of email data. In particular, it focuses on the triage and analysis of unstructured data to identify key actors and relationships within an email network. This paper demonstrates the applicability of the approach by applying relevant stages of the framework to the Enron email corpus. The paper illustrates the advantage of triaging this data to identify (and discount) actors and potential sources of further evidence. It then applies social network analysis techniques to key actors within the data set. This paper posits that visualisation of unstructured data can greatly aid the examiner in their analysis of evidence discovered during an investigation

    Predicting Diffusion Reach Probabilities via Representation Learning on Social Networks

    Full text link
    Diffusion reach probability between two nodes on a network is defined as the probability of a cascade originating from one node reaching to another node. An infinite number of cascades would enable calculation of true diffusion reach probabilities between any two nodes. However, there exists only a finite number of cascades and one usually has access only to a small portion of all available cascades. In this work, we addressed the problem of estimating diffusion reach probabilities given only a limited number of cascades and partial information about underlying network structure. Our proposed strategy employs node representation learning to generate and feed node embeddings into machine learning algorithms to create models that predict diffusion reach probabilities. We provide experimental analysis using synthetically generated cascades on two real-world social networks. Results show that proposed method is superior to using values calculated from available cascades when the portion of cascades is small

    POSTER: Evaluating Privacy Metrics for Graph Anonymization and De-anonymization

    Get PDF
    Many modern communication systems generate graph data, for example social networks and email networks. Such graph data can be used for recommender systems and data mining. However, because graph data contains sensitive information about individuals, sharing or publishing graph data may pose privacy risks. To protect graph privacy, data anonymization has been proposed to prevent individual users in a graph from being identified by adversaries. The effectiveness of both anonymization and de-anonymization techniques is usually evaluated using the adversary’s success rate. However, the success rate does not measure privacy for individual users in a graph because it is an aggregate per-graph metric. In addition, it is unclear whether the success rate is monotonic, i.e. whether it indicates higher privacy for weaker adversaries, and lower privacy for stronger adversaries. To address these gaps, we propose a methodology to systematically evaluate the monotonicity of graph privacy metrics, and present preliminary results for the monotonicity of 25 graph privacy metrics

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms
    corecore