115 research outputs found

    Advances in Learning and Understanding with Graphs through Machine Learning

    Get PDF
    Graphs have increasingly become a crucial way of representing large, complex and disparate datasets from a range of domains, including many scientific disciplines. Graphs are particularly useful at capturing complex relationships or interdependencies within or even between datasets, and enable unique insights which are not possible with other data formats. Over recent years, significant improvements in the ability of machine learning approaches to automatically learn from and identify patterns in datasets have been made. However due to the unique nature of graphs, and the data they are used to represent, employing machine learning with graphs has thus far proved challenging. A review of relevant literature has revealed that key challenges include issues arising with macro-scale graph learning, interpretability of machine learned representations and a failure to incorporate the temporal dimension present in many datasets. Thus, the work and contributions presented in this thesis primarily investigate how modern machine learning techniques can be adapted to tackle key graph mining tasks, with a particular focus on optimal macro-level representation, interpretability and incorporating temporal dynamics into the learning process. The majority of methods employed are novel approaches centered around attempting to use artificial neural networks in order to learn from graph datasets. Firstly, by devising a novel graph fingerprint technique, it is demonstrated that this can successfully be applied to two different tasks whilst out-performing established baselines, namely graph comparison and classification. Secondly, it is shown that a mapping can be found between certain topological features and graph embeddings. This, for perhaps the the first time, suggests that it is possible that machines are learning something analogous to human knowledge acquisition, thus bringing interpretability to the graph embedding process. Thirdly, in exploring two new models for incorporating temporal information into the graph learning process, it is found that including such information is crucial to predictive performance in certain key tasks, such as link prediction, where state-of-the-art baselines are out-performed. The overall contribution of this work is to provide greater insight into and explanation of the ways in which machine learning with respect to graphs is emerging as a crucial set of techniques for understanding complex datasets. This is important as these techniques can potentially be applied to a broad range of scientific disciplines. The thesis concludes with an assessment of limitations and recommendations for future research

    Mean curvature flow

    Get PDF
    Mean curvature flow is the negative gradient flow of volume, so any hypersurface flows through hypersurfaces in the direction of steepest descent for volume and eventually becomes extinct in finite time. Before it becomes extinct, topological changes can occur as it goes through singularities. If the hypersurface is in general or generic position, then we explain what singularities can occur under the flow, what the flow looks like near these singularities, and what this implies for the structure of the singular set. At the end, we will briefly discuss how one may be able to use the flow in low-dimensional topology.National Science Foundation (U.S.) (Grant DMS 11040934)National Science Foundation (U.S.) (Grant DMS 0906233)National Science Foundation (U.S.). Focused Research Group (Grant DMS 0854774)National Science Foundation (U.S.). Focused Research Group (Grant DMS 0853501

    Effective and Trustworthy Dimensionality Reduction Approaches for High Dimensional Data Understanding and Visualization

    Get PDF
    In recent years, the huge expansion of digital technologies has vastly increased the volume of data to be explored. Reducing the dimensionality of data is an essential step in data exploration and visualisation. The integrity of a dimensionality reduction technique relates to the goodness of maintaining the data structure. The visualisation of a low dimensional data that has not captured the high dimensional space data structure is untrustworthy. The scale of maintained data structure by a method depends on several factors, such as the type of data considered and tuning parameters. The type of the data includes linear and nonlinear data, and the tuning parameters include the number of neighbours and perplexity. In reality, most of the data under consideration are nonlinear, and the process to tune parameters could be costly since it depends on the number of data samples considered. Currently, the existing dimensionality reduction approaches suffer from the following problems: 1) Only work well with linear data, 2) The scale of maintained data structure is related to the number of data samples considered, and/or 3) Tear problem and false neighbours problem.To deal with all the above-mentioned problems, this research has developed Same Degree Distribution (SDD), multi-SDD (MSDD) and parameter-free SDD approaches , that 1) Saves computational time because its tuning parameter does not 2) Produces more trustworthy visualisation by using degree-distribution that is smooth enough to capture local and global data structure, and 3) Does not suffer from tear and false neighbours problems due to using the same degree-distribution in the high and low dimensional spaces to calculate the similarities between data samples. The developed dimensionality reduction methods are tested with several popu- lar synthetics and real datasets. The scale of the maintained data structure is evaluated using different quality metrics, i.e., Kendall’s Tau coefficient, Trustworthiness, Continuity, LCMC, and Co-ranking matrix. Also, the theoretical analysis of the impact of dissimilarity measure in structure capturing has been supported by simulations results conducted in two different datasets evaluated by Kendall’s Tau and Co-ranking matrix. The SDD, MSDD, and parameter-free SDD methods do not outperform other global methods such as Isomap in data with a large fraction of large pairwise distances, and it remains a further work task. Reducing the computational complexity is another objective for further work

    Scalable Probabilistic Model Selection for Network Representation Learning in Biological Network Inference

    Get PDF
    A biological system is a complex network of heterogeneous molecular entities and their interactions contributing to various biological characteristics of the system. Although the biological networks not only provide an elegant theoretical framework but also offer a mathematical foundation to analyze, understand, and learn from complex biological systems, the reconstruction of biological networks is an important and unsolved problem. Current biological networks are noisy, sparse and incomplete, limiting the ability to create a holistic view of the biological reconstructions and thus fail to provide a system-level understanding of the biological phenomena. Experimental identification of missing interactions is both time-consuming and expensive. Recent advancements in high-throughput data generation and significant improvement in computational power have led to novel computational methods to predict missing interactions. However, these methods still suffer from several unresolved challenges. It is challenging to extract information about interactions and incorporate that information into the computational model. Furthermore, the biological data are not only heterogeneous but also high-dimensional and sparse presenting the difficulty of modeling from indirect measurements. The heterogeneous nature and sparsity of biological data pose significant challenges to the design of deep neural network structures which use essentially either empirical or heuristic model selection methods. These unscalable methods heavily rely on expertise and experimentation, which is a time-consuming and error-prone process and are prone to overfitting. Furthermore, the complex deep networks tend to be poorly calibrated with high confidence on incorrect predictions. In this dissertation, we describe novel algorithms that address these challenges. In Part I, we design novel neural network structures to learn representation for biological entities and further expand the model to integrate heterogeneous biological data for biological interaction prediction. In part II, we develop a novel Bayesian model selection method to infer the most plausible network structures warranted by data. We demonstrate that our methods achieve the state-of-the-art performance on the tasks across various domains including interaction prediction. Experimental studies on various interaction networks show that our method makes accurate and calibrated predictions. Our novel probabilistic model selection approach enables the network structures to dynamically evolve to accommodate incrementally available data. In conclusion, we discuss the limitations and future directions for proposed works

    Generalized Surgery on Riemannian Manifolds of Positive Ricci Curvature

    Get PDF
    The surgery theorem of Wraith states that the existence of metrics of positive Ricci curvature is preserved under surgery if certain metric and dimensional conditions are satisfied. We generalize this theorem by relaxing the conditions on the dimensions involved and by generalizing the surgery construction itself. As applications we construct metrics of positive Ricci curvature on manifolds obtained by plumbing. Specifically, this construction provides an extension of a result of Burdick on the existence of metrics of positive Ricci curvature on connected sums of linear sphere bundles, and, moreover, it yields infinite families of new examples of manifolds with a metric of positive Ricci curvature in all dimensions divisible by 6

    The Horizontal Tunnelability Graph is Dual to Level Set Trees

    Get PDF
    Time series data, reflecting phenomena like climate patterns and stock prices, offer key insights for prediction and trend analysis. Contemporary research has independently developed disparate geometric approaches to time series analysis. These include tree methods, visibility algorithms, as well as persistence-based barcodes common to topological data analysis. This thesis enhances time series analysis by innovatively combining these perspectives through our concept of horizontal tunnelability. We prove that the level set tree gotten from its Harris Path (a time series), is dual to the time series' horizontal tunnelability graph, itself a subgraph of the more common horizontal visibility graph. This technique extends previous work by relating Merge, Chiral Merge, and Level Set Trees together along with visibility and persistence methodologies. Our method promises significant computational advantages and illuminates the tying threads between previously unconnected work. To facilitate its implementation, we provide accompanying empirical code and discuss its advantages

    Persistent Homology in Multivariate Data Visualization

    Get PDF
    Technological advances of recent years have changed the way research is done. When describing complex phenomena, it is now possible to measure and model a myriad of different aspects pertaining to them. This increasing number of variables, however, poses significant challenges for the visual analysis and interpretation of such multivariate data. Yet, the effective visualization of structures in multivariate data is of paramount importance for building models, forming hypotheses, and understanding intrinsic properties of the underlying phenomena. This thesis provides novel visualization techniques that advance the field of multivariate visual data analysis by helping represent and comprehend the structure of high-dimensional data. In contrast to approaches that focus on visualizing multivariate data directly or by means of their geometrical features, the methods developed in this thesis focus on their topological properties. More precisely, these methods provide structural descriptions that are driven by persistent homology, a technique from the emerging field of computational topology. Such descriptions are developed in two separate parts of this thesis. The first part deals with the qualitative visualization of topological features in multivariate data. It presents novel visualization methods that directly depict topological information, thus permitting the comparison of structural features in a qualitative manner. The techniques described in this part serve as low-dimensional representations that make the otherwise high-dimensional topological features accessible. We show how to integrate them into data analysis workflows based on clustering in order to obtain more information about the underlying data. The efficacy of such combined workflows is demonstrated by analysing complex multivariate data sets from cultural heritage and political science, for example, whose structures are hidden to common visualization techniques. The second part of this thesis is concerned with the quantitative visualization of topological features. It describes novel methods that measure different aspects of multivariate data in order to provide quantifiable information about them. Here, the topological characteristics serve as a feature descriptor. Using these descriptors, the visualization techniques in this part focus on augmenting and improving existing data analysis processes. Among others, they deal with the visualization of high-dimensional regression models, the visualization of errors in embeddings of multivariate data, as well as the assessment and visualization of the results of different clustering algorithms. All the methods presented in this thesis are evaluated and analysed on different data sets in order to show their robustness. This thesis demonstrates that the combination of geometrical and topological methods may support, complement, and surpass existing approaches for multivariate visual data analysis

    Graphlet based network analysis

    Get PDF
    The majority of the existing works on network analysis, study properties that are related to the global topology of a network. Examples of such properties include diameter, power-law exponent, and spectra of graph Laplacians. Such works enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, many of the existing problems on networks require the study of local topological structures of a network. Graphlets which are induced small subgraphs capture the local topological structure of a network effectively. They are becoming increasingly popular for characterizing large networks in recent years. Graphlet based network analysis can vary based on the types of topological structures considered and the kinds of analysis tasks. For example, one of the most popular and early graphlet analyses is based on triples (triangles or paths of length two). Graphlet analysis based on cycles and cliques are also explored in several recent works. Another more comprehensive class of graphlet analysis methods works with graphlets of specific sizes—graphlets with three, four or five nodes ({3, 4, 5}-Graphlets) are particularly popular. For all the above analysis tasks, excessive computational cost is a major challenge, which becomes severe for analyzing large networks with millions of vertices. To overcome this challenge, effective methodologies are in urgent need. Furthermore, the existence of efficient methods for graphlet analysis will encourage more works broadening the scope of graphlet analysis. For graphlet counting, we propose edge iteration based methods (ExactTC and ExactGC) for efficiently computing triple and graphlet counts. The proposed methods compute local graphlet statistics in the neighborhood of each edge in the network and then aggregate the local statistics to give the global characterization (transitivity, graphlet frequency distribution (GFD), etc) of the network. Scalability of the proposed methods is further improved by iterating over a sampled set of edges and estimating the triangle count (ApproxTC) and graphlet count (Graft) by approximate rescaling of the aggregated statistics. The independence of local feature vector construction corresponding to each edge makes the methods embarrassingly parallelizable. We show this by giving a parallel edge iteration method ParApproxTC for triangle counting. For graphlet sampling, we propose Markov Chain Monte Carlo (MCMC) sampling based methods for triple and graphlet analysis. Proposed triple analysis methods, Vertex-MCMC and Triple-MCMC, estimate triangle count and network transitivity. Vertex-MCMC samples triples in two steps. First, the method selects a node (using the MCMC method) with probability proportional to the number of triples of which the node is a center. Then Vertex-MCMC samples uniformly from the triples centered by the selected node. The method Triple-MCMC samples triples by performing a MCMC walk in a triple sample space. Triple sample space consists of all the possible triples in a network. MCMC method performs triple sampling by walking form one triple to one of its neighboring triples in the triple space. We design the triple space in such a way that two triples are neighbors only if they share exactly two nodes. The proposed triple sampling algorithms Vertex-MCMC and Triple-MCMC are able to sample triples from any arbitrary distribution, as long as the weight of each triple is locally computable. The proposed methods are able to sample triples without the knowledge of the complete network structure. Information regarding only the local neighborhood structure of currently observed node or triple are enough to walk to the next node or triple. This gives the proposed methods a significant advantage: the capability to sample triples from networks that have restricted access, on which a direct sampling based method is simply not applicable. The proposed methods are also suitable for dynamic and large networks. Similar to the concept of Triple-MCMC, we propose Guise for sampling graphlets of sizes three, four and five ({3, 4, 5}-Graphlets). Guise samples graphlets, by performing a MCMC walk on a graphlet sample space, containing all the graphlets of sizes three, four and five in the network. Despite the proven utility of graphlets in static network analysis, works harnessing the ability of graphlets for dynamic network analysis are yet to come. Dynamic networks contain additional time information for their edges. With time, the topological structure of a dynamic network changes—edges can appear, disappear and reappear over time. In this direction, predicting the link state of a network at a future time, given a collection of link states at earlier times, is an important task with many real-life applications. In the existing literature, this task is known as link prediction in dynamic networks. Performing this task is more difficult than its counterpart in static networks because an effective feature representation of node-pair instances for the case of a dynamic network is hard to obtain. We design a novel graphlet transition based feature embedding for node-pair instances of a dynamic network. Our proposed method GraTFEL, uses automatic feature learning methodologies on such graphlet transition based features to give a low-dimensional feature embedding of unlabeled node-pair instances. The feature learning task is modeled as an optimal coding task where the objective is to minimize the reconstruction error. GraTFEL solves this optimization task by using a gradient descent method. We validate the effectiveness of the learned optimal feature embedding by utilizing it for link prediction in real-life dynamic networks. Specifically, we show that GraTFEL, which uses the extracted feature embedding of graphlet transition events, outperforms existing methods that use well-known link prediction features
    • …
    corecore