3,420 research outputs found
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking
Data generation is a key issue in big data benchmarking that aims to generate
application-specific data sets to meet the 4V requirements of big data.
Specifically, big data generators need to generate scalable data (Volume) of
different types (Variety) under controllable generation rates (Velocity) while
keeping the important characteristics of raw data (Veracity). This gives rise
to various new challenges about how we design generators efficiently and
successfully. To date, most existing techniques can only generate limited types
of data and support specific big data systems such as Hadoop. Hence we develop
a tool, called Big Data Generator Suite (BDGS), to efficiently generate
scalable big data while employing data models derived from real data to
preserve data veracity. The effectiveness of BDGS is demonstrated by developing
six data generators covering three representative data types (structured,
semi-structured and unstructured) and three data sources (text, graph, and
table data)
A Model of Consistent Node Types in Signed Directed Social Networks
Signed directed social networks, in which the relationships between users can
be either positive (indicating relations such as trust) or negative (indicating
relations such as distrust), are increasingly common. Thus the interplay
between positive and negative relationships in such networks has become an
important research topic. Most recent investigations focus upon edge sign
inference using structural balance theory or social status theory. Neither of
these two theories, however, can explain an observed edge sign well when the
two nodes connected by this edge do not share a common neighbor (e.g., common
friend). In this paper we develop a novel approach to handle this situation by
applying a new model for node types. Initially, we analyze the local node
structure in a fully observed signed directed network, inferring underlying
node types. The sign of an edge between two nodes must be consistent with their
types; this explains edge signs well even when there are no common neighbors.
We show, moreover, that our approach can be extended to incorporate directed
triads, when they exist, just as in models based upon structural balance or
social status theory. We compute Bayesian node types within empirical studies
based upon partially observed Wikipedia, Slashdot, and Epinions networks in
which the largest network (Epinions) has 119K nodes and 841K edges. Our
approach yields better performance than state-of-the-art approaches for these
three signed directed networks.Comment: To appear in the IEEE/ACM International Conference on Advances in
Social Network Analysis and Mining (ASONAM), 201
Entrograms and coarse graining of dynamics on complex networks
Using an information theoretic point of view, we investigate how a dynamics
acting on a network can be coarse grained through the use of graph partitions.
Specifically, we are interested in how aggregating the state space of a Markov
process according to a partition impacts on the thus obtained lower-dimensional
dynamics. We highlight that for a dynamics on a particular graph there may be
multiple coarse grained descriptions that capture different, incomparable
features of the original process. For instance, a coarse graining induced by
one partition may be commensurate with a time-scale separation in the dynamics,
while another coarse graining may correspond to a different lower-dimensional
dynamics that preserves the Markov property of the original process. Taking
inspiration from the literature of Computational Mechanics, we find that a
convenient tool to summarise and visualise such dynamical properties of a
coarse grained model (partition) is the entrogram. The entrogram gathers
certain information-theoretic measures, which quantify how information flows
across time steps. These information theoretic quantities include the entropy
rate, as well as a measure for the memory contained in the process, i.e., how
well the dynamics can be approximated by a first order Markov process. We use
the entrogram to investigate how specific macro-scale connection patterns in
the state-space transition graph of the original dynamics result in desirable
properties of coarse grained descriptions. We thereby provide a fresh
perspective on the interplay between structure and dynamics in networks, and
the process of partitioning from an information theoretic perspective. We focus
on networks that may be approximated by both a core-periphery or a clustered
organization, and highlight that each of these coarse grained descriptions can
capture different aspects of a Markov process acting on the network.Comment: 17 pages, 6 figue
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Transposable regularized covariance models with an application to missing data imputation
Missing data estimation is an important challenge with high-dimensional data
arranged in the form of a matrix. Typically this data matrix is transposable,
meaning that either the rows, columns or both can be treated as features. To
model transposable data, we present a modification of the matrix-variate
normal, the mean-restricted matrix-variate normal, in which the rows and
columns each have a separate mean vector and covariance matrix. By placing
additive penalties on the inverse covariance matrices of the rows and columns,
these so-called transposable regularized covariance models allow for maximum
likelihood estimation of the mean and nonsingular covariance matrices. Using
these models, we formulate EM-type algorithms for missing data imputation in
both the multivariate and transposable frameworks. We present theoretical
results exploiting the structure of our transposable models that allow these
models and imputation methods to be applied to high-dimensional data.
Simulations and results on microarray data and the Netflix data show that these
imputation techniques often outperform existing methods and offer a greater
degree of flexibility.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS314 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …