3,420 research outputs found

    BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking

    Full text link
    Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4V requirements of big data. Specifically, big data generators need to generate scalable data (Volume) of different types (Variety) under controllable generation rates (Velocity) while keeping the important characteristics of raw data (Veracity). This gives rise to various new challenges about how we design generators efficiently and successfully. To date, most existing techniques can only generate limited types of data and support specific big data systems such as Hadoop. Hence we develop a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity. The effectiveness of BDGS is demonstrated by developing six data generators covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data)

    A Model of Consistent Node Types in Signed Directed Social Networks

    Full text link
    Signed directed social networks, in which the relationships between users can be either positive (indicating relations such as trust) or negative (indicating relations such as distrust), are increasingly common. Thus the interplay between positive and negative relationships in such networks has become an important research topic. Most recent investigations focus upon edge sign inference using structural balance theory or social status theory. Neither of these two theories, however, can explain an observed edge sign well when the two nodes connected by this edge do not share a common neighbor (e.g., common friend). In this paper we develop a novel approach to handle this situation by applying a new model for node types. Initially, we analyze the local node structure in a fully observed signed directed network, inferring underlying node types. The sign of an edge between two nodes must be consistent with their types; this explains edge signs well even when there are no common neighbors. We show, moreover, that our approach can be extended to incorporate directed triads, when they exist, just as in models based upon structural balance or social status theory. We compute Bayesian node types within empirical studies based upon partially observed Wikipedia, Slashdot, and Epinions networks in which the largest network (Epinions) has 119K nodes and 841K edges. Our approach yields better performance than state-of-the-art approaches for these three signed directed networks.Comment: To appear in the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 201

    Entrograms and coarse graining of dynamics on complex networks

    Full text link
    Using an information theoretic point of view, we investigate how a dynamics acting on a network can be coarse grained through the use of graph partitions. Specifically, we are interested in how aggregating the state space of a Markov process according to a partition impacts on the thus obtained lower-dimensional dynamics. We highlight that for a dynamics on a particular graph there may be multiple coarse grained descriptions that capture different, incomparable features of the original process. For instance, a coarse graining induced by one partition may be commensurate with a time-scale separation in the dynamics, while another coarse graining may correspond to a different lower-dimensional dynamics that preserves the Markov property of the original process. Taking inspiration from the literature of Computational Mechanics, we find that a convenient tool to summarise and visualise such dynamical properties of a coarse grained model (partition) is the entrogram. The entrogram gathers certain information-theoretic measures, which quantify how information flows across time steps. These information theoretic quantities include the entropy rate, as well as a measure for the memory contained in the process, i.e., how well the dynamics can be approximated by a first order Markov process. We use the entrogram to investigate how specific macro-scale connection patterns in the state-space transition graph of the original dynamics result in desirable properties of coarse grained descriptions. We thereby provide a fresh perspective on the interplay between structure and dynamics in networks, and the process of partitioning from an information theoretic perspective. We focus on networks that may be approximated by both a core-periphery or a clustered organization, and highlight that each of these coarse grained descriptions can capture different aspects of a Markov process acting on the network.Comment: 17 pages, 6 figue

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Transposable regularized covariance models with an application to missing data imputation

    Full text link
    Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so-called transposable regularized covariance models allow for maximum likelihood estimation of the mean and nonsingular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS314 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore