198 research outputs found

    Consistent community detection in uni-layer and multi-layer networks

    Get PDF
    Over the last two decades, we have witnessed a massive explosion of our data collection abilities and the birth of a "big data" age. This has led to an enormous interest in statistical inference of a new type of complex data structure, a graph or network. The surge in interdisciplinary interest on statistical analysis of network data has been driven by applications in Neuroscience, Genetics, Social sciences, Computer science, Economics and Marketing. A network consists of a set of nodes or vertices, representing a set of entities, and a set of edges, representing the relations or interactions among the entities. Networks are flexible frameworks that can model many complex systems. In the majority of the network examples dealt with in the literature, the relations between nodes are assumed to be of the same type such as web page linkage, friendship, co-authorship or protein-protein interaction. However, the complex networks in many modern applications are often multi-layered in the sense that they consist of multiple types of edges/relations among a group of entities. Each of those different types of relations can be viewed as creating its own network, called a layer of the multi-layer network. Multi-layer networks are a more accurate representation of many complex systems since many entities in those systems are involved simultaneously in multiple interactions. In this dissertation we view multi-layer networks in the broad sense that includes multiple types of relations as well as multiple information sources on the same set of nodes (e.g., multiple trials or multiple subjects). The problem of detecting communities or clusters of nodes in a network has received considerable attention in literature. As with uni-layer networks, community detection is an important task in multi-layer networks. This dissertation aims to develop new methods and theory for community detection in both uni-layer and multi-layer networks that can be used to answer scientific questions from experimental data. For community detection in uni and multi-layer graphs, we take three approaches - (1) based on statistical random graph models, (2) based on maximizing quality functions, e.g., the modularity score and (3) based on spectral and matrix factorization methods. In Chapter 2 we consider two random graph models for community detection in multi-layer networks, the multi-layer stochastic block model (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic block model (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compared MLEs in the two models among themselves and with other baseline approaches both theoretically and through simulations. We also derived minimax error rates and thresholds for achieving consistency of community detection in MLSBM, which were then used to show the advantage of the multi-layer model over a traditional alternative, the aggregate stochastic block model. In simulations RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. A popular method of community detection in uni-layer networks is maximization of a partition quality function called modularity. In Chapter 3 we introduce several multi-layer network modularity measures based on different random graph null models, motivated by empirical observations from a diverse field of applications. In particular, we derived different modularities by defining the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks. These measures are then optimized to detect the optimal community assignment of nodes. We apply the methods to five real multi-layer networks - three social networks from the website Twitter, a complete neuronal network of a nematode, C-elegans and a classroom friendship network of 7th-grade students. In Chapter 4 we present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data. In Chapter 5 we once again consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Such methods can be adapted for community detection in multi-layer networks with minimal modifications. However, the theoretical properties of these methods remain largely unexplored and most authors have relied on performance in synthetic and real data to assess the goodness of the procedures. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objective will return a good community structure. We apply some of these methods for consensus community detection in multi-layer networks and investigate the consistency properties of the global optimizer of the objective functions under the multi-layer stochastic block model. We derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup where both the number of nodes and the number of layers of the multi-layer graph grow. We complement the asymptotic analysis with a thorough numerical study to compare the finite sample performance of the methods. Motivated by multi-subject and multi-trial experiments in neuroimaging studies, in Chapter 6 we develop a modeling framework for joint community detection in a group of related networks. The proposed model, which we call the random effects stochastic block model facilitates the study of group differences and subject specific variations in the community structure. In contrast to the previously proposed multi-layer stochastic block models, our model allows community memberships of nodes to vary in each component network or layer with a transition probability matrix, thus modeling the variation in community structure across a group of subjects or trials. We propose two methods to estimate the parameters of the model, a variational-EM algorithm and two non-parametric "two-step" methods based on spectral and matrix factorization respectively. We also develop several hypothesis tests with p-values obtained through resampling (permutation test) for differences in community structure in two groups of subjects both at the whole network level and node level. The methodology is applied to publicly available fMRI datasets from multi-subject experiments involving schizophrenia patients along with healthy controls. Our methods reveal an overall putative community structure representative of the groups as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference

    Sparse machine learning models in bioinformatics

    Get PDF
    The meaning of parsimony is twofold in machine learning: either the structure or (and) the parameter of a model can be sparse. Sparse models have many strengths. First, sparsity is an important regularization principle to reduce model complexity and therefore avoid overfitting. Second, in many fields, for example bioinformatics, many high-dimensional data may be generated by a very few number of hidden factors, thus it is more reasonable to use a proper sparse model than a dense model. Third, a sparse model is often easy to interpret. In this dissertation, we investigate the sparse machine learning models and their applications in high-dimensional biological data analysis. We focus our research on five types of sparse models as follows. First, sparse representation is a parsimonious principle that a sample can be approximated by a sparse linear combination of basis vectors. We explore existing sparse representation models and propose our own sparse representation methods for high dimensional biological data analysis. We derive different sparse representation models from a Bayesian perspective. Two generic dictionary learning frameworks are proposed. Also, kernel and supervised dictionary learning approaches are devised. Furthermore, we propose fast active-set and decomposition methods for the optimization of sparse coding models. Second, gene-sample-time data are promising in clinical study, but challenging in computation. We propose sparse tensor decomposition methods and kernel methods for the dimensionality reduction and classification of such data. As the extensions of matrix factorization, tensor decomposition techniques can reduce the dimensionality of the gene-sample-time data dramatically, and the kernel methods can run very efficiently on such data. Third, we explore two sparse regularized linear models for multi-class problems in bioinformatics. Our first method is called the nearest-border classification technique for data with many classes. Our second method is a hierarchical model. It can simultaneously select features and classify samples. Our experiment, on breast tumor subtyping, shows that this model outperforms the one-versus-all strategy in some cases. Fourth, we propose to use spectral clustering approaches for clustering microarray time-series data. The approaches are based on two transformations that have been recently introduced, especially for gene expression time-series data, namely, alignment-based and variation-based transformations. Both transformations have been devised in order to take into account temporal relationships in the data, and have been shown to increase the ability of a clustering method in detecting co-expressed genes. We investigate the performances of these transformations methods, when combined with spectral clustering on two microarray time-series datasets, and discuss their strengths and weaknesses. Our experiments on two well known real-life datasets show the superiority of the alignment-based over the variation-based transformation for finding meaningful groups of co-expressed genes. Fifth, we propose the max-min high-order dynamic Bayesian network (MMHO-DBN) learning algorithm, in order to reconstruct time-delayed gene regulatory networks. Due to the small sample size of the training data and the power-low nature of gene regulatory networks, the structure of the network is restricted by sparsity. We also apply the qualitative probabilistic networks (QPNs) to interpret the interactions learned. Our experiments on both synthetic and real gene expression time-series data show that, MMHO-DBN can obtain better precision than some existing methods, and perform very fast. The QPN analysis can accurately predict types of influences and synergies. Additionally, since many high dimensional biological data are subject to missing values, we survey various strategies for learning models from incomplete data. We extend the existing imputation methods, originally for two-way data, to methods for gene-sample-time data. We also propose a pair-wise weighting method for computing kernel matrices from incomplete data. Computational evaluations show that both approaches work very robustly

    Non-negative Matrix factorization:Theory and Methods

    Get PDF

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    WELLNESS PROFILING ON SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints

    Get PDF
    One of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages

    Cell Type-specific Analysis of Human Interactome and Transcriptome

    Get PDF
    Cells are the fundamental building block of complex tissues in higher-order organisms. These cells take different forms and shapes to perform a broad range of functions. What makes a cell uniquely eligible to perform a task, however, is not well-understood; neither is the defining characteristic that groups similar cells together to constitute a cell type. Even for known cell types, underlying pathways that mediate cell type-specific functionality are not readily available. These functions, in turn, contribute to cell type-specific susceptibility in various disorders
    corecore