122,870 research outputs found

    Web Information Systems: Usage, Content, and Functionally Modelling

    Get PDF
    The design of large-scale data-intensive web information systems (WIS) requires a clear picture of the intended users and their behaviour in using the system, a support of various access channels and the technology used with them, and an integration of traditional methods for the design of data-intensive information systems with new methods that address the challenges arising from the web-presentation and the open access. This paper presents the conceptual modelling parts of a methodology for the design of WISs that is based on an abstract abstraction layer model (ALM). It concentrates on the two most important layers in this model: a business layer and a conceptual layer. The major activities on the business layer deal with user profiling and storyboarding, which addresses the design of an underlying application story. The core of such a story can be expressed by a directed multi-graph, in which the vertices represent scenes and the edges actions by the users including navigation. This leads to story algebras which can then be used to personalise the WIS to the needs of a user with a particular profile. The major activities on the conceptual layer address the support of scenes by modelling media types, which combine links to databases via extended views with the generation of navigation structures, operations supporting the activities in the storyboard, hierarchical presentations, and adaptivity to users, end-devices and channels. Adding presentation style options this can be used to generate the web-pages that will be presented to the WIS users

    Consistent community detection in uni-layer and multi-layer networks

    Get PDF
    Over the last two decades, we have witnessed a massive explosion of our data collection abilities and the birth of a "big data" age. This has led to an enormous interest in statistical inference of a new type of complex data structure, a graph or network. The surge in interdisciplinary interest on statistical analysis of network data has been driven by applications in Neuroscience, Genetics, Social sciences, Computer science, Economics and Marketing. A network consists of a set of nodes or vertices, representing a set of entities, and a set of edges, representing the relations or interactions among the entities. Networks are flexible frameworks that can model many complex systems. In the majority of the network examples dealt with in the literature, the relations between nodes are assumed to be of the same type such as web page linkage, friendship, co-authorship or protein-protein interaction. However, the complex networks in many modern applications are often multi-layered in the sense that they consist of multiple types of edges/relations among a group of entities. Each of those different types of relations can be viewed as creating its own network, called a layer of the multi-layer network. Multi-layer networks are a more accurate representation of many complex systems since many entities in those systems are involved simultaneously in multiple interactions. In this dissertation we view multi-layer networks in the broad sense that includes multiple types of relations as well as multiple information sources on the same set of nodes (e.g., multiple trials or multiple subjects). The problem of detecting communities or clusters of nodes in a network has received considerable attention in literature. As with uni-layer networks, community detection is an important task in multi-layer networks. This dissertation aims to develop new methods and theory for community detection in both uni-layer and multi-layer networks that can be used to answer scientific questions from experimental data. For community detection in uni and multi-layer graphs, we take three approaches - (1) based on statistical random graph models, (2) based on maximizing quality functions, e.g., the modularity score and (3) based on spectral and matrix factorization methods. In Chapter 2 we consider two random graph models for community detection in multi-layer networks, the multi-layer stochastic block model (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic block model (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compared MLEs in the two models among themselves and with other baseline approaches both theoretically and through simulations. We also derived minimax error rates and thresholds for achieving consistency of community detection in MLSBM, which were then used to show the advantage of the multi-layer model over a traditional alternative, the aggregate stochastic block model. In simulations RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. A popular method of community detection in uni-layer networks is maximization of a partition quality function called modularity. In Chapter 3 we introduce several multi-layer network modularity measures based on different random graph null models, motivated by empirical observations from a diverse field of applications. In particular, we derived different modularities by defining the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks. These measures are then optimized to detect the optimal community assignment of nodes. We apply the methods to five real multi-layer networks - three social networks from the website Twitter, a complete neuronal network of a nematode, C-elegans and a classroom friendship network of 7th-grade students. In Chapter 4 we present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data. In Chapter 5 we once again consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Such methods can be adapted for community detection in multi-layer networks with minimal modifications. However, the theoretical properties of these methods remain largely unexplored and most authors have relied on performance in synthetic and real data to assess the goodness of the procedures. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objective will return a good community structure. We apply some of these methods for consensus community detection in multi-layer networks and investigate the consistency properties of the global optimizer of the objective functions under the multi-layer stochastic block model. We derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup where both the number of nodes and the number of layers of the multi-layer graph grow. We complement the asymptotic analysis with a thorough numerical study to compare the finite sample performance of the methods. Motivated by multi-subject and multi-trial experiments in neuroimaging studies, in Chapter 6 we develop a modeling framework for joint community detection in a group of related networks. The proposed model, which we call the random effects stochastic block model facilitates the study of group differences and subject specific variations in the community structure. In contrast to the previously proposed multi-layer stochastic block models, our model allows community memberships of nodes to vary in each component network or layer with a transition probability matrix, thus modeling the variation in community structure across a group of subjects or trials. We propose two methods to estimate the parameters of the model, a variational-EM algorithm and two non-parametric "two-step" methods based on spectral and matrix factorization respectively. We also develop several hypothesis tests with p-values obtained through resampling (permutation test) for differences in community structure in two groups of subjects both at the whole network level and node level. The methodology is applied to publicly available fMRI datasets from multi-subject experiments involving schizophrenia patients along with healthy controls. Our methods reveal an overall putative community structure representative of the groups as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference

    Multi-layer model for the web graph

    Get PDF
    This paper studies stochastic graph models of the WebGraph. We present a new model that describes the WebGraph as an ensemble of different regions generated by independent stochastic processes (in the spirit of a recent paper by Dill et al. [VLDB 2001]). Models such as the Copying Model [17] and Evolving Networks Model [3] are simulated and compared on several relevant measures such as degree and clique distribution

    When is a Network a Network? Multi-Order Graphical Model Selection in Pathways and Temporal Networks

    Full text link
    We introduce a framework for the modeling of sequential data capturing pathways of varying lengths observed in a network. Such data are important, e.g., when studying click streams in information networks, travel patterns in transportation systems, information cascades in social networks, biological pathways or time-stamped social interactions. While it is common to apply graph analytics and network analysis to such data, recent works have shown that temporal correlations can invalidate the results of such methods. This raises a fundamental question: when is a network abstraction of sequential data justified? Addressing this open question, we propose a framework which combines Markov chains of multiple, higher orders into a multi-layer graphical model that captures temporal correlations in pathways at multiple length scales simultaneously. We develop a model selection technique to infer the optimal number of layers of such a model and show that it outperforms previously used Markov order detection techniques. An application to eight real-world data sets on pathways and temporal networks shows that it allows to infer graphical models which capture both topological and temporal characteristics of such data. Our work highlights fallacies of network abstractions and provides a principled answer to the open question when they are justified. Generalizing network representations to multi-order graphical models, it opens perspectives for new data mining and knowledge discovery algorithms.Comment: 10 pages, 4 figures, 1 table, companion python package pathpy available on gitHu

    Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

    Full text link
    Many large datasets exhibit power-law statistics: The web graph, social networks, text data, click through data etc. Their adjacency graphs are termed natural graphs, and are known to be difficult to partition. As a consequence most distributed algorithms on these graphs are communication intensive. Many algorithms on natural graphs involve an Allreduce: a sum or average of partitioned data which is then shared back to the cluster nodes. Examples include PageRank, spectral partitioning, and many machine learning algorithms including regression, factor (topic) models, and clustering. In this paper we describe an efficient and scalable Allreduce primitive for power-law data. We point out scaling problems with existing butterfly and round-robin networks for Sparse Allreduce, and show that a hybrid approach improves on both. Furthermore, we show that Sparse Allreduce stages should be nested instead of cascaded (as in the dense case). And that the optimum throughput Allreduce network should be a butterfly of heterogeneous degree where degree decreases with depth into the network. Finally, a simple replication scheme is introduced to deal with node failures. We present experiments showing significant improvements over existing systems such as PowerGraph and Hadoop
    • …
    corecore