Bayesian network structure learning in the presence of data noise

Abstract

A Bayesian Network (BN) is a type of a probabilistic graphical model that captures conditional and marginal independencies between variables. These models are generally represented by a Directed Acyclic Graph (DAG), which is composed by nodes and arcs. The nodes represent variables and the absence of arcs represent conditional or marginal independencies. When BNs are applied to real-world problems, the structure of these models is often assumed to be causal (often referred to as a causal BN), and is often constructed from either expert knowledge and Randomised Controlled Trials (RCTs). However, these two approaches can be time-consuming and expensive, and it might not always be possible or ethical to perform RCTs. As a result, structure learning algorithms that recover graphical structures from observational data, which in turn could be used to inform causal structures, have received increasing attention over the past few decades. To be able to guarantee the correctness of a structure learnt from data, a structure learning algorithm must rely on assumptions that may not hold in practice. One such crucial and commonly used assumption is that the observed data are independently and identically sampled from the underlying distribution, such that all statistical quantities of the distribution can be recovered with no bias from the observed data when sample size goes to infinite. While such assumptions are often needed to be able to devise theoretical guarantees, the impact of violating these assumptions when working with real data tends to be overlooked. Empirical investigations show that structure learning algorithms perform considerably worse on noisy data that violate many of their theoretical assumptions, relative to how they perform on clean synthetic data that do not violate any of their data-generating assumptions. However, there has been limited research on how to deal with these problems effectively and efficiently. This thesis investigates this research direction and primarily focuses on improving structure learning in the presence of measurement error and systematic missing data, which are two of the most common types of data noise present in real data sets

    Similar works