115 research outputs found

    Overcoming uncertainty for within-network relational machine learning

    Get PDF
    People increasingly communicate through email and social networks to maintain friendships and conduct business, as well as share online content such as pictures, videos and products. Relational machine learning (RML) utilizes a set of observed attributes and network structure to predict corresponding labels for items; for example, to predict individuals engaged in securities fraud, we can utilize phone calls and workplace information to make joint predictions over the individuals. However, in large scale and partially observed network domains, missing labels and edges can significantly impact standard relational machine learning methods by introducing bias into the learning and inference processes. In this thesis, we identify the effects on parameter estimation, correct the biases, and model the uncertainty of the missing data to improve predictive performance. In particular, we investigate this issue on a variety of modeling scenarios and prediction problems.^ First, we introduce the Transitive Chung Lu random graph model for modeling the conditional distribution of edges given a partially observed network. This model fits within a class of scalable generative graph models with scalable sampling processes that we generalize to model distributions of networks with correlated attribute variables via Attributed Graph Models. Second, we utilize TCL to incorporate edge probabilities into relational learning and inference models for partially observed network domains. As part of this work, give a linear time algorithm to perform variational inference over a squared network. We apply the resulting semi-supervised model, Probabilistic Relational EM (PR-EM) to the Active Exploration domain to iteratively locate positive examples in partially observed networks. Due to the sampling process, this domain exhibits extreme bias for learning and inference: we show that PR-EM operates with high accuracy despite the difficult domain. Third, we investigate the performance applying Relational EM methods for semi-supervised relational learning in partially labeled networks and find that fixed point estimates have considerable approximation errors during learning and inference. To solve this, we propose the stochastic Relational Stochastic EM and Relational Data Augmentation methods for semi-supervised relational learning and demonstrate that these approaches improve over the Relational EM method. Fourth, we improve on existing semi-supervised learning methods by imposing hard constraints on the inference steps, allowing semi-supervised methods to learn using better approximations during learning and inference for partially labeled networks. In particular, we find that we can correct for the approximated parameter learning errors during the collective inference step by imposing a Maximum Entropy constraint. We find that this correction allows us to utilize a better approximation over the unlabeled data. In addition, we prove that given an allowable error, this method is only a constant overhead to the original collective inference method. Overall, all of the methods presented in this thesis have provable subquadratic runtimes. We demonstrate each on large scale networks, in some cases including networks with millions of vertices and/or edges. Across all these approaches, we show that incorporating the uncertainty into the modeling process improves modeling and predictive performance

    Block-Approximated Exponential Random Graphs

    Get PDF
    An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs. By utilizing fast matrix block-approximation techniques, we propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions, while being able to meaningfully model both local information of the graph (e.g., degrees) as well as global information (e.g., clustering coefficient, assortativity, etc.) if desired. This allows one to efficiently generate random networks with similar properties as an observed network, and the models can be used for several downstream tasks such as link prediction. Our methods are scalable to sparse graphs consisting of millions of nodes. Empirical evaluation demonstrates competitiveness in terms of both speed and accuracy with state-of-the-art methods -- which are typically based on embedding the graph into some low-dimensional space -- for link prediction, showcasing the potential of a more direct and interpretable probabalistic model for this task.Comment: Accepted for DSAA 2020 conferenc

    Bioinformatics Tools for RNA-seq Data Analysis

    Get PDF
    RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. The availability of RNA-seq data encouraged computational biologists to develop algorithms to process the data in a statistically disciplinary manner to generate biologically meaningful results. Clustering viral sequences allows us to characterize the composition and structure of intrahost and interhost viral populations, which play a crucial role in disease progression and epidemic spread. In this research, we propose and validate a new entropy-based method for clustering aligned viral sequences considered as categorical data. The method finds a homogeneous clustering by minimizing information entropy rather than the distance between sequences in the same cluster. Moreover in this research, we present a novel pathway analysis method based on Expectation-Maximization (EM) algorithm to study the enzyme expression and pathway activity using meta-transcriptomic data. We will also discuss our approaches to generating unique gene signatures to understand the role of sensory nerve interference in the anti-melanoma immune response and study the racial disparity in Triple-negative breast cancer. Finally, we present our method to detect the retained introns in RNA-seq data to develop a vaccine against cancer having p53 mutations. In summary, this research provides novel approaches to exploring RNA-seq data and their application to real-world biological research

    Mining and modeling graphs using patterns and priors

    No full text

    Signed Network Modeling Based on Structural Balance Theory

    Full text link
    The modeling of networks, specifically generative models, have been shown to provide a plethora of information about the underlying network structures, as well as many other benefits behind their construction. Recently there has been a considerable increase in interest for the better understanding and modeling of networks, but the vast majority of this work has been for unsigned networks. However, many networks can have positive and negative links(or signed networks), especially in online social media, and they inherently have properties not found in unsigned networks due to the added complexity. Specifically, the positive to negative link ratio and the distribution of signed triangles in the networks are properties that are unique to signed networks and would need to be explicitly modeled. This is because their underlying dynamics are not random, but controlled by social theories, such as Structural Balance Theory, which loosely states that users in social networks will prefer triadic relations that involve less tension. Therefore, we propose a model based on Structural Balance Theory and the unsigned Transitive Chung-Lu model for the modeling of signed networks. Our model introduces two parameters that are able to help maintain the positive link ratio and proportion of balanced triangles. Empirical experiments on three real-world signed networks demonstrate the importance of designing models specific to signed networks based on social theories to obtain better performance in maintaining signed network properties while generating synthetic networks.Comment: CIKM 2018: https://dl.acm.org/citation.cfm?id=327174

    Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications (Extended Version)

    Get PDF
    Although the ``scale-free'' literature is large and growing, it gives neither a precise definition of scale-free graphs nor rigorous proofs of many of their claimed properties. In fact, it is easily shown that the existing theory has many inherent contradictions and verifiably false claims. In this paper, we propose a new, mathematically precise, and structural definition of the extent to which a graph is scale-free, and prove a series of results that recover many of the claimed properties while suggesting the potential for a rich and interesting theory. With this definition, scale-free (or its opposite, scale-rich) is closely related to other structural graph properties such as various notions of self-similarity (or respectively, self-dissimilarity). Scale-free graphs are also shown to be the likely outcome of random construction processes, consistent with the heuristic definitions implicit in existing random graph approaches. Our approach clarifies much of the confusion surrounding the sensational qualitative claims in the scale-free literature, and offers rigorous and quantitative alternatives.Comment: 44 pages, 16 figures. The primary version is to appear in Internet Mathematics (2005

    Dynamical effects of degree correlations in networks of type I model neurons : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Mathematics at Massey University, Auckland, New Zealand

    Get PDF
    The complex behaviour of human brains arises from the complex interconnection of the well-known building blocks -- neurons. With novel imaging techniques it is possible to monitor firing patterns and link them to brain function or dysfunction. How the network structure affects neuronal activity is, however, poorly understood. In this thesis we study the effects of degree correlations in recurrent neuronal networks on self-sustained activity patterns. Firstly, we focus on correlations between the in- and out-degrees of individual neurons. By using Theta Neurons and Ott/Antonsen theory, we can derive a set of coupled differential equations for the expected dynamics of neurons with equal in-degree. A Gaussian copula is used to introduce correlations between a neuron’s in- and out-degree, and numerical bifurcation analysis is used determine the effects of these correlations on the network's dynamics. We find that positive correlations increase the mean firing rate, while negative correlations have the opposite effect. Secondly, we turn to degree correlations between neurons -- often referred to as degree assortativity -- which describes the increased or decreased probability of connecting two neurons based on their in-or out-degrees, relative to what would be expected by chance. We present an alternative derivation of coarse-grained degree mean field equations utilising Theta Neurons and the Ott/Antonsen ansatz as well, but incorporate actual adjacency matrices. Families of degree connectivity matrices are parametrised by assortativity coefficients and subsequently reduced by singular value decomposition. Thus, we efficiently perform numerical bifurcation analysis on a set of coarse-grained equations. To our best knowledge, this is the first time a study examines the four possible types of degree assortativity separately, showing that two have no effect on the networks' dynamics, while the other two can have a significant effect

    Mining complex trees for hidden fruit : a graph–based computational solution to detect latent criminal networks : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Technology at Massey University, Albany, New Zealand.

    Get PDF
    The detection of crime is a complex and difficult endeavour. Public and private organisations – focusing on law enforcement, intelligence, and compliance – commonly apply the rational isolated actor approach premised on observability and materiality. This is manifested largely as conducting entity-level risk management sourcing ‘leads’ from reactive covert human intelligence sources and/or proactive sources by applying simple rules-based models. Focusing on discrete observable and material actors simply ignores that criminal activity exists within a complex system deriving its fundamental structural fabric from the complex interactions between actors - with those most unobservable likely to be both criminally proficient and influential. The graph-based computational solution developed to detect latent criminal networks is a response to the inadequacy of the rational isolated actor approach that ignores the connectedness and complexity of criminality. The core computational solution, written in the R language, consists of novel entity resolution, link discovery, and knowledge discovery technology. Entity resolution enables the fusion of multiple datasets with high accuracy (mean F-measure of 0.986 versus competitors 0.872), generating a graph-based expressive view of the problem. Link discovery is comprised of link prediction and link inference, enabling the high-performance detection (accuracy of ~0.8 versus relevant published models ~0.45) of unobserved relationships such as identity fraud. Knowledge discovery uses the fused graph generated and applies the “GraphExtract” algorithm to create a set of subgraphs representing latent functional criminal groups, and a mesoscopic graph representing how this set of criminal groups are interconnected. Latent knowledge is generated from a range of metrics including the “Super-broker” metric and attitude prediction. The computational solution has been evaluated on a range of datasets that mimic an applied setting, demonstrating a scalable (tested on ~18 million node graphs) and performant (~33 hours runtime on a non-distributed platform) solution that successfully detects relevant latent functional criminal groups in around 90% of cases sampled and enables the contextual understanding of the broader criminal system through the mesoscopic graph and associated metadata. The augmented data assets generated provide a multi-perspective systems view of criminal activity that enable advanced informed decision making across the microscopic mesoscopic macroscopic spectrum

    Disentangling ecological networks in marine microbes

    Get PDF
    There is a myriad of microorganisms on Earth contributing to global biogeochemical cycles, and their interactions are considered pivotal for ecosystem function. Previous studies have already determined relationships between a limited number of microorganisms. Yet, we still need to understand a large number of interactions to increase our knowledge of complex microbiomes. This is challenging because of the vast number of possible interactions. Thus, microbial interactions still remain barely known to date. Networks are a great tool to handle the vast number of microorganisms and their connections, explore potential microbial interactions, and elucidate patterns of microbial ecosystems. This thesis locates at the intersection of network inference and network analysis. The presented methodology aims to support and advance marine microbial investigations by reducing noise and elucidating patterns in inferred association networks for subsequent biological down-stream analyses. This thesis’s main contribution to marine microbial interactions studies is the development of the program EnDED (Environmentally-Driven Edge Detection), a computational framework to identify environmentally-driven associations inside microbial association networks, inferred from omics datasets. We applied the methodology to a model marine microbial ecosystem at the Blanes Bay Microbial Observatory (BBMO) in the North-Western Mediterranean Sea (ten years of monthly sampling). We also applied the methodology to a dataset compilation covering six global-ocean regions from the surface (3 m) to the deep ocean (down to 4539 m). Thus, our methodology provided a step towards studying the marine microbial distribution in space via the horizontal (ocean regions) and vertical (water column) axes.Hi ha una infinitat de microorganismes a la Terra que contribueixen als cicles biogeoquímics mundials i les seves interaccions es consideren fonamentals pel funcionament dels ecosistemes. Estudis previs ja han determinat les relacions entre un nombre limitat de microorganismes. Tot i això, encara hem d’entendre un gran nombre d’interaccions per augmentar el nostre coneixement dels microbiomes complexos. Això és un repte a causa del gran nombre d'interaccions possibles. Per això, les interaccions microbianes encara són poc conegudes fins ara. Les xarxes són una gran eina per tractar el gran nombre de microorganismes i les seves connexions, explorar interaccions microbianes potencials i dilucidar patrons d’ecosistemes microbians. Aquesta tesi es situa a la intersecció de la inferència de xarxes i l’anàlisi de la xarxes. La metodologia presentada té com a objectiu donar suport i avançar en investigacions microbianes marines reduint el soroll i dilucidant patrons en xarxes d’associació inferides per a posteriors anàlisis biològiques. La principal contribució d’aquesta tesi als estudis d’interaccions microbianes marines és el desenvolupament del programa EnDED (Environmentally-Driven Edge Detection), un marc computacional per identificar associacions impulsades pel medi ambient dins de xarxes d’associació microbiana, inferides a partir de conjunts de dades òmics. S’ha aplicat la metodologia a un model d’ecosistema microbià marí a l’Observatori Microbià de la Badia de Blanes (BBMO) al mar Mediterrani nord-occidental (deu anys de mostreig mensual). També s’ha la metodologia a una recopilació de dades que cobreix sis regions oceàniques globals des de la superfície (3 m) fins a l'oceà profund (fins a 4539 m).Hay una gran cantidad de microorganismos en la Tierra que contribuyen a los ciclos biogeoquímicos globales, y sus interacciones se consideran fundamentales para la función del ecosistema. Estudios previos ya han determinado relaciones entre un número limitado de microorganismos. Sin embargo, todavía necesitamos comprender una gran cantidad de interacciones para aumentar nuestro conocimiento de los microbiomas más complejos. Esto representa un gran desafío debido a la gran cantidad de posibles interacciones. Por lo tanto, las interacciones microbianas son aun poco conocidas. Las redes representan una gran herramienta para analizar la gran cantidad de microorganismos y sus conexiones, explorar posibles interacciones y dilucidar patrones en ecosistemas microbianos. Esta tesis se ubica en la intersección entre la inferencia de redes y el análisis de redes. La metodología presentada tiene como objetivo avanzar las investigaciones sobre interacciones microbianas marinas mediante la reducción del ruido en las inferencias de redes y elucidar patrones en redes de asociación permitiendo análisis biológicos posteriores. La principal contribución de esta tesis a los estudios de interacciones microbianas marinas es el desarrollo del programa EnDED (Environmentally-Driven Edge Detection), un marco computacional para identificar asociaciones generadas por el medio ambiente en redes de asociaciones microbianas, inferidas a partir de datos ómicos. Aplicamos la metodología a un modelo de ecosistema microbiano marino en el Observatorio Microbiano de la Bahía de Blanes (BBMO) en el Mar Mediterráneo Noroccidental (diez años de muestreo mensual). También, aplicamos la metodología a una compilación de conjuntos de datos que cubren seis regiones oceánicas globales desde la superficie (3 m) hasta las profundidades del océano (hasta 4539 m). Por lo tanto, nuestra metodología significa un paso adelante hacia de los patrones temporales microbianos marinos y el estudio de la distribución microbiana marina en el espacio a través de los ejes horizontal (regiones oceánicas) y vertical (columna de agua). Para llegar a hipótesis de interacción precisas, es importante determinar, cuantificar y eliminar las asociaciones generadas por el medio ambiente en las redes de asociaciones microbianas marinas. Además, nuestros resultados subrayaron la necesidad de estudiar la naturaleza dinámica de las redes, en contraste con el uso de redes estáticas únicas agregadas en el tiempo o el espacio. Nuestras nuevas metodologías pueden ser utilizadas por una amplia gama de investigadores que investigan redes e interacciones en diversos microbiomas.Postprint (published version
    corecore