610 research outputs found

    Overcoming uncertainty for within-network relational machine learning

    Get PDF
    People increasingly communicate through email and social networks to maintain friendships and conduct business, as well as share online content such as pictures, videos and products. Relational machine learning (RML) utilizes a set of observed attributes and network structure to predict corresponding labels for items; for example, to predict individuals engaged in securities fraud, we can utilize phone calls and workplace information to make joint predictions over the individuals. However, in large scale and partially observed network domains, missing labels and edges can significantly impact standard relational machine learning methods by introducing bias into the learning and inference processes. In this thesis, we identify the effects on parameter estimation, correct the biases, and model the uncertainty of the missing data to improve predictive performance. In particular, we investigate this issue on a variety of modeling scenarios and prediction problems.^ First, we introduce the Transitive Chung Lu random graph model for modeling the conditional distribution of edges given a partially observed network. This model fits within a class of scalable generative graph models with scalable sampling processes that we generalize to model distributions of networks with correlated attribute variables via Attributed Graph Models. Second, we utilize TCL to incorporate edge probabilities into relational learning and inference models for partially observed network domains. As part of this work, give a linear time algorithm to perform variational inference over a squared network. We apply the resulting semi-supervised model, Probabilistic Relational EM (PR-EM) to the Active Exploration domain to iteratively locate positive examples in partially observed networks. Due to the sampling process, this domain exhibits extreme bias for learning and inference: we show that PR-EM operates with high accuracy despite the difficult domain. Third, we investigate the performance applying Relational EM methods for semi-supervised relational learning in partially labeled networks and find that fixed point estimates have considerable approximation errors during learning and inference. To solve this, we propose the stochastic Relational Stochastic EM and Relational Data Augmentation methods for semi-supervised relational learning and demonstrate that these approaches improve over the Relational EM method. Fourth, we improve on existing semi-supervised learning methods by imposing hard constraints on the inference steps, allowing semi-supervised methods to learn using better approximations during learning and inference for partially labeled networks. In particular, we find that we can correct for the approximated parameter learning errors during the collective inference step by imposing a Maximum Entropy constraint. We find that this correction allows us to utilize a better approximation over the unlabeled data. In addition, we prove that given an allowable error, this method is only a constant overhead to the original collective inference method. Overall, all of the methods presented in this thesis have provable subquadratic runtimes. We demonstrate each on large scale networks, in some cases including networks with millions of vertices and/or edges. Across all these approaches, we show that incorporating the uncertainty into the modeling process improves modeling and predictive performance

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

    Get PDF
    This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

    Modeling Structural Brain Connectivity

    Get PDF

    Learning Collective Behavior in Multi-relational Networks

    Get PDF
    With the rapid expansion of the Internet and WWW, the problem of analyzing social media data has received an increasing amount of attention in the past decade. The boom in social media platforms offers many possibilities to study human collective behavior and interactions on an unprecedented scale. In the past, much work has been done on the problem of learning from networked data with homogeneous topologies, where instances are explicitly or implicitly inter-connected by a single type of relationship. In contrast to traditional content-only classification methods, relational learning succeeds in improving classification performance by leveraging the correlation of the labels between linked instances. However, networked data extracted from social media, web pages, and bibliographic databases can contain entities of multiple classes and linked by various causal reasons, hence treating all links in a homogeneous way can limit the performance of relational classifiers. Learning the collective behavior and interactions in heterogeneous networks becomes much more complex. The contribution of this dissertation include 1) two classification frameworks for identifying human collective behavior in multi-relational social networks; 2) unsupervised and supervised learning models for relationship prediction in multi-relational collaborative networks. Our methods improve the performance of homogeneous predictive models by differentiating heterogeneous relations and capturing the prominent interaction patterns underlying the network structure. The work has been evaluated in various real-world social networks. We believe that this study will be useful for analyzing human collective behavior and interactions specifically in the scenario when the heterogeneous relationships in the network arise from various causal reasons

    Transfer learning for multicenter classification of chronic obstructive pulmonary disease

    Get PDF
    Chronic obstructive pulmonary disease (COPD) is a lung disease which can be quantified using chest computed tomography (CT) scans. Recent studies have shown that COPD can be automatically diagnosed using weakly supervised learning of intensity and texture distributions. However, up till now such classifiers have only been evaluated on scans from a single domain, and it is unclear whether they would generalize across domains, such as different scanners or scanning protocols. To address this problem, we investigate classification of COPD in a multi-center dataset with a total of 803 scans from three different centers, four different scanners, with heterogenous subject distributions. Our method is based on Gaussian texture features, and a weighted logistic classifier, which increases the weights of samples similar to the test data. We show that Gaussian texture features outperform intensity features previously used in multi-center classification tasks. We also show that a weighting strategy based on a classifier that is trained to discriminate between scans from different domains, can further improve the results. To encourage further research into transfer learning methods for classification of COPD, upon acceptance of the paper we will release two feature datasets used in this study on http://bigr.nl/research/projects/copdComment: Accepted at Journal of Biomedical and Health Informatic

    MACHINE LEARNING APPLICATIONS TO DATA RECONSTRUCTION IN MARINE BIOGEOCHEMISTRY.

    Get PDF
    Driven by the increase of greenhouse gas emissions, climate change is causing significant shifts in the Earth's climatic patterns, profoundly affecting our oceans. In recent years, our capacity to monitor and understand the state and variability of the ocean has been significantly enhanced, thanks to improved observational capacity, new data-driven approaches, and advanced computational capabilities. Contemporary marine analyses typically integrate multiple data sources: numerical models, satellite data, autonomous instruments, and ship-based measurements. Temperature, salinity, and several other ocean essential variables, such as oxygen, chlorophyll, and nutrients, are among the most frequently monitored variables. Each of these sources and variables, while providing valuable insights, has distinct limitations in terms of uncertainty, spatial and temporal coverage, and resolution. The application of deep learning offers a promising avenue for addressing challenges in data prediction, notably in data reconstruction and interpolation, thus enhancing our ability to monitor and understand the ocean. This thesis proposes and evaluates the performances of a variety of neural network architectures, examining the intricate relationship between methods, ocean data sources, and challenges. A special focus is given to the biogeochemistry of the Mediterranean Sea. A primary objective is predicting low-sampled biogeochemical variables from high-sampled ones. For this purpose, two distinct deep learning models have been developed, each specifically tailored to the dataset used for training. Addressing this challenge not only boosts our capability to predict biogeochemical variables in the highly heterogeneous Mediterranean Sea region but also allows the increase in the usefulness of observational systems such as the BGC-Argo floats. Additionally, a method is introduced to integrate BGC-Argo float observations with outputs from an existing deterministic marine ecosystem model, refining our ability to interpolate and reconstruct biogeochemical variables in the Mediterranean Sea. As the development of novel neural network methods progresses rapidly, the task of establishing benchmarks for data-driven ocean modeling is far from complete. This work offers insights into various applications, highlighting their strengths and limitations, besides highlighting the importance relationship between methods and datasets.Driven by the increase of greenhouse gas emissions, climate change is causing significant shifts in the Earth's climatic patterns, profoundly affecting our oceans. In recent years, our capacity to monitor and understand the state and variability of the ocean has been significantly enhanced, thanks to improved observational capacity, new data-driven approaches, and advanced computational capabilities. Contemporary marine analyses typically integrate multiple data sources: numerical models, satellite data, autonomous instruments, and ship-based measurements. Temperature, salinity, and several other ocean essential variables, such as oxygen, chlorophyll, and nutrients, are among the most frequently monitored variables. Each of these sources and variables, while providing valuable insights, has distinct limitations in terms of uncertainty, spatial and temporal coverage, and resolution. The application of deep learning offers a promising avenue for addressing challenges in data prediction, notably in data reconstruction and interpolation, thus enhancing our ability to monitor and understand the ocean. This thesis proposes and evaluates the performances of a variety of neural network architectures, examining the intricate relationship between methods, ocean data sources, and challenges. A special focus is given to the biogeochemistry of the Mediterranean Sea. A primary objective is predicting low-sampled biogeochemical variables from high-sampled ones. For this purpose, two distinct deep learning models have been developed, each specifically tailored to the dataset used for training. Addressing this challenge not only boosts our capability to predict biogeochemical variables in the highly heterogeneous Mediterranean Sea region but also allows the increase in the usefulness of observational systems such as the BGC-Argo floats. Additionally, a method is introduced to integrate BGC-Argo float observations with outputs from an existing deterministic marine ecosystem model, refining our ability to interpolate and reconstruct biogeochemical variables in the Mediterranean Sea. As the development of novel neural network methods progresses rapidly, the task of establishing benchmarks for data-driven ocean modeling is far from complete. This work offers insights into various applications, highlighting their strengths and limitations, besides highlighting the importance relationship between methods and datasets

    Model Selection for Stochastic Block Models

    Get PDF
    As a flexible representation for complex systems, networks (graphs) model entities and their interactions as nodes and edges. In many real-world networks, nodes divide naturally into functional communities, where nodes in the same group connect to the rest of the network in similar ways. Discovering such communities is an important part of modeling networks, as community structure offers clues to the processes which generated the graph. The stochastic block model is a popular network model based on community structures. It splits nodes into blocks, within which all nodes are stochastically equivalent in terms of how they connect to the rest of the network. As a generative model, it has a well-defined likelihood function with consistent parameter estimates. It is also highly flexible, capable of modeling a wide variety of community structures, including degree specific and overlapping communities. Performance of different block models vary under different scenarios. Picking the right model is crucial for successful network modeling. A good model choice should balance the trade-off between complexity and fit. The task of model selection is to automatically choose such a model given the data and the inference task. As a problem of wide interest, numerous statistical model selection techniques have been developed for classic independent data. Unfortunately, it has been a common mistake to use these techniques in block models without rigorous examinations of their derivations, ignoring the fact that some of the fundamental assumptions has been violated by moving into the domain of relational data sets such as networks. In this dissertation, I thoroughly exam the literature of statistical model selection techniques, including both Frequentist and Bayesian approaches. My goal is to develop principled statistical model selection criteria for block models by adapting classic methods for network data. I do this by running bootstrapping simulations with an efficient algorithm, and correcting classic model selection theories for block models based on the simulation data. The new model selection methods are verified by both synthetic and real world data sets

    Doctor of Philosophy

    Get PDF
    dissertationIt is imperative to obtain a complete network graph of at least one representative retina if we are to fully understand vertebrate vision. Synaptic connectomics endeavors to construct such graphs. Though previously prevented by hardware and software limitations, the creation of customized viewing and analysis software, affordable data storage, and advances in electron imaging platform control now permit connectome assembly and analysis. The optimal strategy for building complete connectomes utilizes automated transmission electron imaging with 2 nm or better resolution, molecular tags for cell identification, open access data volumes for navigation, and annotation with open source tools to build three-dimensional cell libraries, complete network diagrams, and connectivity databases. In a few years, the first retinal connectome analyses reveal that many well-studied cells participate in much richer networks than expected. Collectively, these results impel a refactoring of the inner plexiform layer, while providing proof of concept for connectomics as a game-changing approach for a new era of scientific discovery
    corecore