89 research outputs found

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    Data Integration for Regulatory Module Discovery

    No full text
    Genomic data relating to the functioning of individual genes and their products are rapidly being produced using many different and diverse experimental techniques. Each piece of data provides information on a specific aspect of the cell regulation process. Integration of these diverse types of data is essential in order to identify biologically relevant regulatory modules. In this thesis, we address this challenge by analyzing the nature of these datasets and propose new techniques of data integration. Since microarray data is not available in quantities that are required for valid inference, many researchers have taken the blind integrative approach where data from diverse microarray experiments are merged. In order to understand the validity of this approach, we start this thesis with studying the heterogeneity of microarray datasets. We have used KL divergence between individual dataset distributions as well as an empirical technique proposed by us to calculate functional similarity between the datasets. Our results indicate that we should not use a blind integration of datasets and much care should be taken to ensure that we mix only similar types of data. We should also be careful about the choice of normalization method. Next, we propose a semi-supervised spectral clustering method which integrates two diverse types of data for the task of gene regulatory module discovery. The technique uses constraints derived from DNA-binding, PPI and TF-gene interactions datasets to guide the clustering (spectral) of microarray experiments. Our results on yeast stress and cell-cycle microarray data indicate that the integration leads to more biologically significant results. Finally, we propose a technique that integrates datasets under the principle of maximum entropy. We argue that this is the most valid approach in an unsupervised setting where we have no other evidence regarding the weights to be assigned to individual datasets. Our experiments with yeast microarray, PPI, DNA-binding and TF-gene interactions datasets show improved biological significance of results

    Computational Methods for Knowledge Integration in the Analysis of Large-scale Biological Networks

    Get PDF
    abstract: As we migrate into an era of personalized medicine, understanding how bio-molecules interact with one another to form cellular systems is one of the key focus areas of systems biology. Several challenges such as the dynamic nature of cellular systems, uncertainty due to environmental influences, and the heterogeneity between individual patients render this a difficult task. In the last decade, several algorithms have been proposed to elucidate cellular systems from data, resulting in numerous data-driven hypotheses. However, due to the large number of variables involved in the process, many of which are unknown or not measurable, such computational approaches often lead to a high proportion of false positives. This renders interpretation of the data-driven hypotheses extremely difficult. Consequently, a dismal proportion of these hypotheses are subject to further experimental validation, eventually limiting their potential to augment existing biological knowledge. This dissertation develops a framework of computational methods for the analysis of such data-driven hypotheses leveraging existing biological knowledge. Specifically, I show how biological knowledge can be mapped onto these hypotheses and subsequently augmented through novel hypotheses. Biological hypotheses are learnt in three levels of abstraction -- individual interactions, functional modules and relationships between pathways, corresponding to three complementary aspects of biological systems. The computational methods developed in this dissertation are applied to high throughput cancer data, resulting in novel hypotheses with potentially significant biological impact.Dissertation/ThesisPh.D. Computer Science 201

    An Integrative Computational Framework for Defining Asthma Endotypes

    Get PDF
    The rapid pace of drug development in recent years has led to the recognition that new pharmacotherapies do not have the same effect on all patients. This is particularly true in the case of complex common diseases such as hypertension, diabetes and asthma, where a diversity of pathogenetic factors may interact to produce the same disease, resulting in a large degree of heterogeneity in the response to medical therapy. For this reason, the ability to differentiate between different disease endotypes is of increasing importance to clinical medicine. In the case of asthma, initial studies have hinted at the presence of multiple disease endotypes with different clinical characteristics. Additional studies have identified novel genetic risk factors and differences in gene expression among asthmatic patients with different disease endotypes. Despite the presence of large-scale clinical and molecular datasets from asthmatic patients, limited efforts have been made to integrate these different formats to develop a systems-level understanding of disease mechanism. In this thesis, we develop a computational framework for addressing the problem of disease heterogeneity by integrating data from multiple sources, including the genome, phenome and transcriptome in order to define clinically-relevant disease subtypes, and we demonstrate its application in a cohort of asthmatic children. First we perform a cluster analysis of clinical phenotypic data and detect the presence of multiple disease endotypes in a cohort of children with mild-to-moderate asthma. We evaluate the clinical significance of these endotypes by demonstrating their longtudinal stability and association with differential response to pharmacotherapy. Next, we develop a transcriptional network from the gene expression profiles of these patients and identify the relationship between discrete patterns of expression and asthma endotypes. Finally, we combine longitudinally-derived clinical phenotypes with genetic data to uncover novel genetic associations corresponding to changes in gene expression and the expression of longitudinal clinical traits

    Finding the pathology of major depression through effects on gene interaction networks

    Get PDF
    The disease signature of major depressive disorder is distributed across multiple physical scales and investigative specialties, including genes, cells and brain regions. No single mechanism or pathway currently implicated in depression can reproduce its diverse clinical presentation, which compounds the difficulty in finding consistently disrupted molecular functions. We confront these key roadblocks to depression research - multi-scale and multi-factor pathology - by conducting parallel investigations at the levels of genes, neurons and brain regions, using transcriptome networks to identify collective patterns of dysfunction. Our findings highlight how the collusion of multi-system deficits can form a broad-based, yet variable pathology behind the depressed phenotype. For instance, in a variant of the classic lethality-centrality relationship, we show that in neuropsychiatric disorders including major depression, differentially expressed genes are pushed out to the periphery of gene networks. At the level of cellular function, we develop a molecular signature of depression based on cross-species analysis of human and mouse microarrays from depression-affected areas, and show that these genes form a tight module related to oligodendrocyte function and neuronal growth/structure. At the level of brain-region communication, we find a set of genes and hormones associated with the loss of feedback between the amygdala and anterior cingulate cortex, based on a novel assay of interregional expression synchronization termed "gene coordination". These results indicate that in the absence of a single pathology, depression may be created by dysynergistic effects among genes, cell-types and brain regions, in what we term the "floodgate" model of depression. Beyond our specific biological findings, these studies indicate that gene interaction networks are a coherent framework in which to understand the faint expression changes found in depression and complex neuropsychiatric disorders

    Data based identification and prediction of nonlinear and complex dynamical systems

    Get PDF
    We thank Dr. R. Yang (formerly at ASU), Dr. R.-Q. Su (formerly at ASU), and Mr. Zhesi Shen for their contributions to a number of original papers on which this Review is partly based. This work was supported by ARO under Grant No. W911NF-14-1-0504. W.-X. Wang was also supported by NSFC under Grants No. 61573064 and No. 61074116, as well as by the Fundamental Research Funds for the Central Universities, Beijing Nova Programme.Peer reviewedPostprin

    Network-Based Biomarker Discovery : Development of Prognostic Biomarkers for Personalized Medicine by Integrating Data and Prior Knowledge

    Get PDF
    Advances in genome science and technology offer a deeper understanding of biology while at the same time improving the practice of medicine. The expression profiling of some diseases, such as cancer, allows for identifying marker genes, which could be able to diagnose a disease or predict future disease outcomes. Marker genes (biomarkers) are selected by scoring how well their expression levels can discriminate between different classes of disease or between groups of patients with different clinical outcome (e.g. therapy response, survival time, etc.). A current challenge is to identify new markers that are directly related to the underlying disease mechanism
    • …
    corecore