76 research outputs found

    Graph set data mining

    Get PDF
    Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time

    Frameworks to Investigate Robustness and Disease Characterization/Prediction Utility of Time-Varying Functional Connectivity State Profiles of the Human Brain at Rest

    Get PDF
    Neuroimaging technologies aim at delineating the highly complex structural and functional organization of the human brain. In recent years, several unimodal as well as multimodal analyses of structural MRI (sMRI) and functional MRI (fMRI) neuroimaging modalities, leveraging advanced signal processing and machine learning based feature extraction algorithms, have opened new avenues in diagnosis of complex brain syndromes and neurocognitive disorders. Generically regarding these neuroimaging modalities as filtered, complimentary insights of brain’s anatomical and functional organization, multimodal data fusion efforts could enable more comprehensive mapping of brain structure and function. Large scale functional organization of the brain is often studied by viewing the brain as a complex, integrative network composed of spatially distributed, but functionally interacting, sub-networks that continually share and process information. Such whole-brain functional interactions, also referred to as patterns of functional connectivity (FC), are typically examined as levels of synchronous co-activation in the different functional networks of the brain. More recently, there has been a major paradigm shift from measuring the whole-brain FC in an oversimplified, time-averaged manner to additional exploration of time-varying mechanisms to identify the recurring, transient brain configurations or brain states, referred to as time-varying FC state profiles in this dissertation. Notably, prior studies based on time-varying FC approaches have made use of these relatively lower dimensional fMRI features to characterize pathophysiology and have also been reported to relate to demographic characterization, consciousness levels and cognition. In this dissertation, we corroborate the efficacy of time-varying FC state profiles of the human brain at rest by implementing statistical frameworks to evaluate their robustness and statistical significance through an in-depth, novel evaluation on multiple, independent partitions of a very large rest-fMRI dataset, as well as extensive validation testing on surrogate rest-fMRI datasets. In the following, we present a novel data-driven, blind source separation based multimodal (sMRI-fMRI) data fusion framework that uses the time-varying FC state profiles as features from the fMRI modality to characterize diseased brain conditions and substantiate brain structure-function relationships. Finally, we present a novel data-driven, deep learning based multimodal (sMRI-fMRI) data fusion framework that examines the degree of diagnostic and prognostic performance improvement based on time-varying FC state profiles as features from the fMRI modality. The approaches developed and tested in this dissertation evince high levels of robustness and highlight the utility of time-varying FC state profiles as potential biomarkers to characterize, diagnose and predict diseased brain conditions. As such, the findings in this work argue in favor of the view of FC investigations of the brain that are centered on time-varying FC approaches, and also highlight the benefits of combining multiple neuroimaging data modalities via data fusion

    Bayesian methodologies for constrained spaces.

    Get PDF
    Due to advances in technology, there is a presence of directional data in a wide variety of fields. Often distributions to model directional data are defined on manifold or constrained spaces. Regular statistical methods applied to data defined on special geometries can give misleading results, and this demands new statistical theory. This dissertation addresses two such problems and develops Bayesian methodologies to improve inference in these arenas. It consists of two projects: 1. A Bayesian Methodology for Estimation for Sparse Canonical Correlation, and 2. Bayesian Analysis of Finite Mixture Model for Spherical Data. In principle, it can be challenging to integrate data measured on the same individuals occurring from different experiments and model it together to gain a larger understanding of the problem. Canonical Correlation Analysis (CCA) provides a useful tool for establishing relationships between such data sets. When dealing with high dimensional data sets, Structured Sparse CCA (ScSCCA) is a rapidly developing methodological area which seeks to represent the interrelations using sparse direction vectors for CCA. There is less development in Bayesian methodology in this area. We propose a novel Bayesian ScSCCA method with the use of a Bayesian infinite factor model. Using a multiplicative half Cauchy prior process, we bring in sparsity at the level of the projection matrix. Additionally, we promote further sparsity in the covariance matrix by using graphical horseshoe prior or diagonal structure. We compare the results for our proposed model with competing frequentist and Bayesian methods and apply the developed method to omics data arising from a breast cancer study. In the second project, we perform Bayesian Analysis for the von Mises Fisher (vMF) distribution on the sphere which is a common and important distribution used for directional data. In the first part of this project, we propose a new conjugate prior for the mean vector and concentration parameter of the vMF distribution. Further we prove its properties like finiteness, unimodality, and provide interpretations of its hyperparameters. In the second part, we utilize a popular prior structure for a mixture of vMF distributions. In this case, the posterior of the concentration parameter consists of an intractable Bessel function of the first kind. We propose a novel Data Augmentation Strategy (DAS) using a Negative Binomial Distribution that removes this intractable Bessel function. Furthermore, we apply the developed methodology to Diffusion Tensor Imaging (DTI) data for clustering to explore voxel connectivity in human brain

    Neural function approximation on graphs: shape modelling, graph discrimination & compression

    Get PDF
    Graphs serve as a versatile mathematical abstraction of real-world phenomena in numerous scientific disciplines. This thesis is part of the Geometric Deep Learning subject area, a family of learning paradigms, that capitalise on the increasing volume of non-Euclidean data so as to solve real-world tasks in a data-driven manner. In particular, we focus on the topic of graph function approximation using neural networks, which lies at the heart of many relevant methods. In the first part of the thesis, we contribute to the understanding and design of Graph Neural Networks (GNNs). Initially, we investigate the problem of learning on signals supported on a fixed graph. We show that treating graph signals as general graph spaces is restrictive and conventional GNNs have limited expressivity. Instead, we expose a more enlightening perspective by drawing parallels between graph signals and signals on Euclidean grids, such as images and audio. Accordingly, we propose a permutation-sensitive GNN based on an operator analogous to shifts in grids and instantiate it on 3D meshes for shape modelling (Spiral Convolutions). Following, we focus on learning on general graph spaces and in particular on functions that are invariant to graph isomorphism. We identify a fundamental trade-off between invariance, expressivity and computational complexity, which we address with a symmetry-breaking mechanism based on substructure encodings (Graph Substructure Networks). Substructures are shown to be a powerful tool that provably improves expressivity while controlling computational complexity, and a useful inductive bias in network science and chemistry. In the second part of the thesis, we discuss the problem of graph compression, where we analyse the information-theoretic principles and the connections with graph generative models. We show that another inevitable trade-off surfaces, now between computational complexity and compression quality, due to graph isomorphism. We propose a substructure-based dictionary coder - Partition and Code (PnC) - with theoretical guarantees that can be adapted to different graph distributions by estimating its parameters from observations. Additionally, contrary to the majority of neural compressors, PnC is parameter and sample efficient and is therefore of wide practical relevance. Finally, within this framework, substructures are further illustrated as a decisive archetype for learning problems on graph spaces.Open Acces

    Representation Learning for Words and Entities

    Get PDF
    This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview LSA (MVLSA). Through experiments on close to 50 different views, I show that MVLSA outperforms other state-of-the-art word embedding models. After that, I focus on learning entity representations for search and recommendation and present the second algorithm of this thesis called Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. Moreover, I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints

    Kernel Feature Extraction Methods for Remote Sensing Data Analysis

    Get PDF
    Technological advances in the last decades have improved our capabilities of collecting and storing high data volumes. However, this makes that in some fields, such as remote sensing several problems are generated in the data processing due to the peculiar characteristics of their data. High data volume, high dimensionality, heterogeneity and their nonlinearity, make that the analysis and extraction of relevant information from these images could be a bottleneck for many real applications. The research applying image processing and machine learning techniques along with feature extraction, allows the reduction of the data dimensionality while keeps the maximum information. Therefore, developments and applications of feature extraction methodologies using these techniques have increased exponentially in remote sensing. This improves the data visualization and the knowledge discovery. Several feature extraction methods have been addressed in the literature depending on the data availability, which can be classified in supervised, semisupervised and unsupervised. In particular, feature extraction can use in combination with kernel methods (nonlinear). The process for obtaining a space that keeps greater information content is facilitated by this combination. One of the most important properties of the combination is that can be directly used for general tasks including classification, regression, clustering, ranking, compression, or data visualization. In this Thesis, we address the problems of different nonlinear feature extraction approaches based on kernel methods for remote sensing data analysis. Several improvements to the current feature extraction methods are proposed to transform the data in order to make high dimensional data tasks easier, such as classification or biophysical parameter estimation. This Thesis focus on three main objectives to reach these improvements in the current feature extraction methods: The first objective is to include invariances into supervised kernel feature extraction methods. Throughout these invariances it is possible to generate virtual samples that help to mitigate the problem of the reduced number of samples in supervised methods. The proposed algorithm is a simple method that essentially generates new (synthetic) training samples from available labeled samples. These samples along with original samples should be used in feature extraction methods obtaining more independent features between them that without virtual samples. The introduction of prior knowledge by means of the virtual samples could obtain classification and biophysical parameter estimation methods more robust than without them. The second objective is to use the generative kernels, i.e. probabilistic kernels, that directly learn by means of clustering techniques from original data by finding local-to-global similarities along the manifold. The proposed kernel is useful for general feature extraction purposes. Furthermore, the kernel attempts to improve the current methods because the kernel not only contains labeled data information but also uses the unlabeled information of the manifold. Moreover, the proposed kernel is parameter free in contrast with the parameterized functions such as, the radial basis function (RBF). Using probabilistic kernels is sought to obtain new unsupervised and semisupervised methods in order to reduce the number and cost of labeled data in remote sensing. Third objective is to develop new kernel feature extraction methods for improving the features obtained by the current methods. Optimizing the functional could obtain improvements in new algorithm. For instance, the Optimized Kernel Entropy Component Analysis (OKECA) method. The method is based on the Independent Component Analysis (ICA) framework resulting more efficient than the standard Kernel Entropy Component Analysis (KECA) method in terms of dimensionality reduction. In this Thesis, the methods are focused on remote sensing data analysis. Nevertheless, feature extraction methods are used to analyze data of several research fields whereas data are multidimensional. For these reasons, the results are illustrated into experimental sequence. First, the projections are analyzed by means of Toy examples. The algorithms are tested through standard databases with supervised information to proceed to the last step, the analysis of remote sensing images by the proposed methods
    • …
    corecore