82,586 research outputs found
EMail Data Mining: An Approach to Construct an Organization Position-wise Structure While Performing EMail Analysis
In this age of social networking, it is necessary to define the relationships among the members of a social network. Various techniques are already available to define user- to-user relationships across the network. Over time, many algorithms and machine learning techniques were applied to find relationships over social networks, yet very few techniques and information are available to define a relation directly over raw email data. Few educational societies have developed a way to mine the email log files and have found the inter-relation between the users by means of clusters. Again, there is no solid technique available that can accurately predict the ranking of each user within an organization by mining through their email transaction logs. The author in this report presents a technique to mine the email data log files in order to figure out the position wise structure of an organization. The author also discusses send-receive analysis, statistical analysis, semantic analysis and temporal analysis over the data, and has applied them to test cases. Throughout the research the author has used the Enron employees email log files, which was made public on 2001
Online Social Networks: Measurements, Analysis and Solutions for Mining Challenges
In the last decade, online social networks showed enormous growth. With the rise
of these networks and the consequent availability of wealth social network data, Social
Network Analysis (SNA) led researchers to get the opportunity to access, analyse and
mine the social behaviour of millions of people, explore the way they communicate and
exchange information.
Despite the growing interest in analysing social networks, there are some challenges
and implications accompanying the analysis and mining of these networks. For example,
dealing with large-scale and evolving networks is not yet an easy task and still requires
a new mining solution. In addition, finding communities within these networks is a
challenging task and could open opportunities to see how people behave in groups on a
large scale. Also, the challenge of validating and optimizing communities without knowing
in advance the structure of the network due to the lack of ground truth is yet another
challenging barrier for validating the meaningfulness of the resulting communities.
In this thesis, we started by providing an overview of the necessary background and key
concepts required in the area of social networks analysis. Our main focus is to provide
solutions to tackle the key challenges in this area. For doing so, first, we introduce a predictive
technique to help in the prediction of the execution time of the analysis tasks for
evolving networks through employing predictive modeling techniques to the problem of
evolving and large-scale networks. Second, we study the performance of existing community
detection approaches to derive high quality community structure using a real email
network through analysing the exchange of emails and exploring community dynamics.
The aim is to study the community behavioral patterns and evaluate their quality within
an actual network. Finally, we propose an ensemble technique for deriving communities
using a rich internal enterprise real network in IBM that reflects real collaborations
and communications between employees. The technique aims to improve the community
detection process through the fusion of different algorithms
A framework for the forensic investigation of unstructured email relationship data
Our continued reliance on email communications ensures that it remains a major source of evidence during a digital investigation. Emails comprise both structured and unstructured data. Structured data provides qualitative information to the forensics examiner and is typically viewed through existing tools. Unstructured data is more complex as it comprises information associated with social networks, such as relationships within the network, identification of key actors and power relations, and there are currently no standardised tools for its forensic analysis. Moreover, email investigations may involve many hundreds of actors and thousands of messages. This paper posits a framework for the forensic investigation of email data. In particular, it focuses on the triage and analysis of unstructured data to identify key actors and relationships within an email network. This paper demonstrates the applicability of the approach by applying relevant stages of the framework to the Enron email corpus. The paper illustrates the advantage of triaging this data to identify (and discount) actors and potential sources of further evidence. It then applies social network analysis techniques to key actors within the data set. This paper posits that visualisation of unstructured data can greatly aid the examiner in their analysis of evidence discovered during an investigation
Predicting Diffusion Reach Probabilities via Representation Learning on Social Networks
Diffusion reach probability between two nodes on a network is defined as the
probability of a cascade originating from one node reaching to another node. An
infinite number of cascades would enable calculation of true diffusion reach
probabilities between any two nodes. However, there exists only a finite number
of cascades and one usually has access only to a small portion of all available
cascades. In this work, we addressed the problem of estimating diffusion reach
probabilities given only a limited number of cascades and partial information
about underlying network structure. Our proposed strategy employs node
representation learning to generate and feed node embeddings into machine
learning algorithms to create models that predict diffusion reach
probabilities. We provide experimental analysis using synthetically generated
cascades on two real-world social networks. Results show that proposed method
is superior to using values calculated from available cascades when the portion
of cascades is small
POSTER: Evaluating Privacy Metrics for Graph Anonymization and De-anonymization
Many modern communication systems generate graph data, for
example social networks and email networks. Such graph data
can be used for recommender systems and data mining. However,
because graph data contains sensitive information about individuals,
sharing or publishing graph data may pose privacy risks. To protect
graph privacy, data anonymization has been proposed to prevent
individual users in a graph from being identified by adversaries.
The effectiveness of both anonymization and de-anonymization
techniques is usually evaluated using the adversary’s success rate.
However, the success rate does not measure privacy for individual
users in a graph because it is an aggregate per-graph metric. In
addition, it is unclear whether the success rate is monotonic, i.e.
whether it indicates higher privacy for weaker adversaries, and
lower privacy for stronger adversaries. To address these gaps, we
propose a methodology to systematically evaluate the monotonicity
of graph privacy metrics, and present preliminary results for the
monotonicity of 25 graph privacy metrics
Network Sampling: From Static to Streaming Graphs
Network sampling is integral to the analysis of social, information, and
biological networks. Since many real-world networks are massive in size,
continuously evolving, and/or distributed in nature, the network structure is
often sampled in order to facilitate study. For these reasons, a more thorough
and complete understanding of network sampling is critical to support the field
of network science. In this paper, we outline a framework for the general
problem of network sampling, by highlighting the different objectives,
population and units of interest, and classes of network sampling methods. In
addition, we propose a spectrum of computational models for network sampling
methods, ranging from the traditionally studied model based on the assumption
of a static domain to a more challenging model that is appropriate for
streaming domains. We design a family of sampling methods based on the concept
of graph induction that generalize across the full spectrum of computational
models (from static to streaming) while efficiently preserving many of the
topological properties of the input graphs. Furthermore, we demonstrate how
traditional static sampling algorithms can be modified for graph streams for
each of the three main classes of sampling methods: node, edge, and
topology-based sampling. Our experimental results indicate that our proposed
family of sampling methods more accurately preserves the underlying properties
of the graph for both static and streaming graphs. Finally, we study the impact
of network sampling algorithms on the parameter estimation and performance
evaluation of relational classification algorithms
- …