20,987 research outputs found
Inferring Social Status and Rich Club Effects in Enterprise Communication Networks
Social status, defined as the relative rank or position that an individual
holds in a social hierarchy, is known to be among the most important motivating
forces in social behaviors. In this paper, we consider the notion of status
from the perspective of a position or title held by a person in an enterprise.
We study the intersection of social status and social networks in an
enterprise. We study whether enterprise communication logs can help reveal how
social interactions and individual status manifest themselves in social
networks. To that end, we use two enterprise datasets with three communication
channels --- voice call, short message, and email --- to demonstrate the
social-behavioral differences among individuals with different status. We have
several interesting findings and based on these findings we also develop a
model to predict social status. On the individual level, high-status
individuals are more likely to be spanned as structural holes by linking to
people in parts of the enterprise networks that are otherwise not well
connected to one another. On the community level, the principle of homophily,
social balance and clique theory generally indicate a "rich club" maintained by
high-status individuals, in the sense that this community is much more
connected, balanced and dense. Our model can predict social status of
individuals with 93% accuracy.Comment: 13 pages, 4 figure
Identifying Graphs from Noisy Observational Data
There is a growing amount of data describing networks -- examples include social networks, communication networks, and biological networks. As the amount of available data increases, so does our interest in analyzing the properties and characteristics of these networks. However, in most cases the data is noisy, incomplete, and the result of passively acquired observational data; naively analyzing these networks without taking these errors into account can result in inaccurate and misleading conclusions. In my dissertation, I study the tasks of entity resolution, link prediction, and collective classification to address these deficiencies. I describe these tasks in detail and discuss my own work on each of these tasks. For entity resolution, I develop a method for resolving the identities of name mentions in email communications. For link prediction, I develop a method for inferring subordinate-manager relationships between individuals in an email communication network. For collective classification, I propose an adaptive active surveying method to address node labeling in a query-driven setting on network data. In many real-world settings, however, these deficiencies are not found in isolation and all need to be addressed to infer the desired complete and accurate network. Furthermore, because of the dependencies typically found in these tasks, the tasks are inherently inter-related and must be performed jointly. I define the general problem of graph identification which simultaneously performs these tasks; removing the noise and missing values in the observed input network and inferring the complete and accurate output network. I present a novel approach to graph identification using a collection of Coupled Collective Classifiers, C3, which, in addition to capturing the variety of features typically used for each task, can capture the intra- and inter-dependencies required to correctly infer nodes, edges, and labels in the output network. I discuss variants of C3 using different learning and inference paradigms and show the superior performance of C3, in terms of both prediction quality and runtime performance, over various previous approaches. I then conclude by presenting the Graph Alignment, Identification, and Analysis (GAIA) open-source software library which not only provides an implementation of C3 but also algorithms for various tasks in network data such as entity resolution, link prediction, collective classification, clustering, active learning, data generation, and analysis
Characterizing and Detecting Unrevealed Elements of Network Systems
This dissertation addresses the problem of discovering and characterizing unknown elements in network systems. Klir (1985) provides a general definition of a system as “... a set of some things and a relation among the things (p. 4). A system, where the `things\u27, i.e. nodes, are related through links is a network system (Klir, 1985). The nodes can represent a range of entities such as machines or people (Pearl, 2001; Wasserman & Faust, 1994). Likewise, links can represent abstract relationships such as causal influence or more visible ties such as roads (Pearl, 1988, pp. 50-51; Wasserman & Faust, 1994; Winston, 1994, p. 394). It is not uncommon to have incomplete knowledge of network systems due to either passive circumstances, e.g. limited resources to observe a network, active circumstances, e.g. intentional acts of concealment, or some combination of active and passive influences (McCormick & Owen, 2000, p. 175; National Research Council, 2005, pp. 7, 11). This research provides statistical and graph theoretic approaches for such situations, including those in which nodes are causally related (Geiger & Pearl, 1990, pp. 3, 10; Glymour, Scheines, Spirtes, & Kelly, 1987, pp. 75-86, 178183; Murphy, 1998; Verma & Pearl, 1991, pp. 257, 260, 264-265). A related aspect of this research is accuracy assessment. It is possible an analyst could fail to detect a network element, or be aware of network elements, but incorrectly conclude the associated network system structure (Borgatti, Carley, & Krackhardt, 2006). The possibilities require assessment of the accuracy of the observed and conjectured network systems, and this research provides a means to do so (Cavallo & Klir, 1979, p. 143; Kelly, 1957, p. 968)
Reconstructing propagation networks with natural diversity and identifying hidden sources
Our ability to uncover complex network structure and dynamics from data is
fundamental to understanding and controlling collective dynamics in complex
systems. Despite recent progress in this area, reconstructing networks with
stochastic dynamical processes from limited time series remains to be an
outstanding problem. Here we develop a framework based on compressed sensing to
reconstruct complex networks on which stochastic spreading dynamics take place.
We apply the methodology to a large number of model and real networks, finding
that a full reconstruction of inhomogeneous interactions can be achieved from
small amounts of polarized (binary) data, a virtue of compressed sensing.
Further, we demonstrate that a hidden source that triggers the spreading
process but is externally inaccessible can be ascertained and located with high
confidence in the absence of direct routes of propagation from it. Our approach
thus establishes a paradigm for tracing and controlling epidemic invasion and
information diffusion in complex networked systems.Comment: 20 pages and 5 figures. For Supplementary information, please see
http://www.nature.com/ncomms/2014/140711/ncomms5323/full/ncomms5323.html#
Detecting hierarchical relationships and roles from online interaction networks
In social networks, analysing the explicit interactions among users can help in
inferring hierarchical relationships and roles that may be implicit. In this thesis,
we focus on two objectives: detecting hierarchical relationships between users and
inferring the hierarchical roles of users interacting via the same online communication
medium. In both cases, we show that considering the temporal dimension of
interaction substantially improves the detection of relationships and roles.
The first focus of this thesis is on the problem of inferring implicit relationships
from interactions between users. Based on promising results obtained by standard
link-analysis methods such as PageRank and Rooted-PageRank (RPR), we introduce
three novel time-based approaches, \Time-F" based on a defined time function,
Filter and Refine (FiRe) which is a hybrid approach based on RPR and Time-F,
and Time-sensitive Rooted-PageRank (T-RPR) which applies RPR in a way that
takes into account the time-dimension of interactions in the process of detecting
hierarchical ties.
We experiment on two datasets, the Enron email dataset to infer managersubordinate
relationships from email exchanges, and a scientific publication coauthorship
dataset to detect PhD advisor-advisee relationships from paper co-authorships.
Our experiments demonstrate that time-based methods perform better in terms of
recall. In particular T-RPR turns out to be superior over most recent competitor
methods as well as all other approaches we propose.
The second focus of this thesis is examining the online communication behaviour
of users working on the same activity in order to identify the different hierarchical
roles played by the users. We propose two approaches. In the first approach, supervised
learning is used to train different classification algorithms. In the second
approach, we address the problem as a sequence classification problem. A novel
sequence classification framework is defined that generates time-dependent features based on frequent patterns at multiple levels of time granularity. Our framework is
a
exible technique for sequence classification to be applied in different domains.
We experiment on an educational dataset collected from an asynchronous communication
tool used by students to accomplish an underlying group project. Our
experimental findings show that the first supervised approach achieves the best mapping
of students to their roles when the individual attributes of the students, information
about the reply relationships among them as well as quantitative time-based
features are considered. Similarly, our multi-granularity pattern-based framework
shows competitive performance in detecting the students' roles. Both approaches
are significantly better than the baselines considered
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methods
Causal relationships are commonly examined in manufacturing processes to
support faults investigations, perform interventions, and make strategic
decisions. Industry 4.0 has made available an increasing amount of data that
enable data-driven Causal Discovery (CD). Considering the growing number of
recently proposed CD methods, it is necessary to introduce strict benchmarking
procedures on publicly available datasets since they represent the foundation
for a fair comparison and validation of different methods. This work introduces
two novel public datasets for CD in continuous manufacturing processes. The
first dataset employs the well-known Tennessee Eastman simulator for fault
detection and process control. The second dataset is extracted from an
ultra-processed food manufacturing plant, and it includes a description of the
plant, as well as multiple ground truths. These datasets are used to propose a
benchmarking procedure based on different metrics and evaluated on a wide
selection of CD algorithms. This work allows testing CD methods in realistic
conditions enabling the selection of the most suitable method for specific
target applications. The datasets are available at the following link:
https://github.com/giovanniMenComment: Supplementary Materials at:
https://github.com/giovanniMen/CPCaD-Benc
- …