7 research outputs found
Recommended from our members
Appropriate, accessible and appealing probabilistic graphical models
Appropriate - Many multivariate probabilistic models either use independent distributions or dependent Gaussian distributions. Yet, many real-world datasets contain count-valued or non-negative skewed data, e.g. bag-of-words text data and biological sequencing data. Thus, we develop novel probabilistic graphical models for use on count-valued and non-negative data including Poisson graphical models and multinomial graphical models. We develop one generalization that allows for triple-wise or k-wise graphical models going beyond the normal pairwise formulation. Furthermore, we also explore Gaussian-copula graphical models and derive closed-form solutions for the conditional distributions and marginal distributions (both before and after conditioning). Finally, we derive mixture and admixture, or topic model, generalizations of these graphical models to introduce more power and interpretability.
Accessible - Previous multivariate models, especially related to text data, often have complex dependencies without a closed form and require complex inference algorithms that have limited theoretical justification. For example, hierarchical Bayesian models often require marginalizing over many latent variables. We show that our novel graphical models (even the k-wise interaction models) have simple and intuitive estimation procedures based on node-wise regressions that likely have similar theoretical guarantees as previous work in graphical models. For the copula-based graphical models, we show that simple approximations could still provide useful models; these copula models also come with closed-form conditional and marginal distributions, which make them amenable to exploratory inspection and manipulation. The parameters of these models are easy to interpret and thus may be accessible to a wide audience.
Appealing - High-level visualization and interpretation of graphical models with even 100 variables has often been difficult even for a graphical model expert---despite visualization being one of the original motivators for graphical models. This difficulty is likely due to the lack of collaboration between graphical model experts and visualization experts. To begin bridging this gap, we develop a novel "what if?" interaction that manipulates and leverages the probabilistic power of graphical models. Our approach defines: the probabilistic mechanism via conditional probability; the query language to map text input to a conditional probability query; and the formal underlying probabilistic model. We then propose to visualize these query-specific probabilistic graphical models by combining the intuitiveness of force-directed layouts with the beauty and readability of word clouds, which pack many words into valuable screen space while ensuring words do not overlap via pixel-level collision detection. Although both the force-directed layout and the pixel-level packing problems are challenging in their own right, we approximate both simultaneously via adaptive simulated annealing starting from careful initialization. For visualizing mixture distributions, we also design a meaningful mapping from the properties of the mixture distribution to a color in the perceptually uniform CIELUV color space. Finally, we demonstrate our approach via illustrative visualizations of several real-world datasets.Computer Science
Minimax rate for multivariate data under componentwise local differential privacy constraints
Our research delves into the balance between maintaining privacy and
preserving statistical accuracy when dealing with multivariate data that is
subject to \textit{componentwise local differential privacy} (CLDP). With CLDP,
each component of the private data is made public through a separate privacy
channel. This allows for varying levels of privacy protection for different
components or for the privatization of each component by different entities,
each with their own distinct privacy policies. We develop general techniques
for establishing minimax bounds that shed light on the statistical cost of
privacy in this context, as a function of the privacy levels of the components. We demonstrate the versatility and efficiency
of these techniques by presenting various statistical applications.
Specifically, we examine nonparametric density and covariance estimation under
CLDP, providing upper and lower bounds that match up to constant factors, as
well as an associated data-driven adaptive procedure. Furthermore, we quantify
the probability of extracting sensitive information from one component by
exploiting the fact that, on another component which may be correlated with the
first, a smaller degree of privacy protection is guaranteed
Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations
Institute for Adaptive and Neural ComputationNon-parametric models and techniques enjoy a growing popularity in the field of
machine learning, and among these Bayesian inference for Gaussian process (GP)
models has recently received significant attention. We feel that GP priors should
be part of the standard toolbox for constructing models relevant to machine
learning in the same way as parametric linear models are, and the results in this
thesis help to remove some obstacles on the way towards this goal.
In the first main chapter, we provide a distribution-free finite sample bound
on the difference between generalisation and empirical (training) error for GP
classification methods. While the general theorem (the PAC-Bayesian bound)
is not new, we give a much simplified and somewhat generalised derivation and
point out the underlying core technique (convex duality) explicitly. Furthermore,
the application to GP models is novel (to our knowledge). A central feature of
this bound is that its quality depends crucially on task knowledge being encoded
faithfully in the model and prior distributions, so there is a mutual benefit between
a sharp theoretical guarantee and empirically well-established statistical
practices. Extensive simulations on real-world classification tasks indicate an impressive
tightness of the bound, in spite of the fact that many previous bounds
for related kernel machines fail to give non-trivial guarantees in this practically
relevant regime.
In the second main chapter, sparse approximations are developed to address
the problem of the unfavourable scaling of most GP techniques with large training
sets. Due to its high importance in practice, this problem has received a lot of attention
recently. We demonstrate the tractability and usefulness of simple greedy
forward selection with information-theoretic criteria previously used in active
learning (or sequential design) and develop generic schemes for automatic model
selection with many (hyper)parameters. We suggest two new generic schemes and
evaluate some of their variants on large real-world classification and regression
tasks. These schemes and their underlying principles (which are clearly stated
and analysed) can be applied to obtain sparse approximations for a wide regime
of GP models far beyond the special cases we studied here
Design of Physical System Experiments Using Bayes Linear Emulation and History Matching Methodology with Application to Arabidopsis Thaliana
There are many physical processes within our world which scientists aim to understand. Computer models representing these processes are fundamental to achieving such understanding. Bayes linear emulation is a powerful tool for comprehensively exploring the behaviour of computationally intensive models. History matching is a method for finding the set of inputs to a computer model for which the corresponding model outputs give acceptable matches to observed data, given our state of uncertainty regarding the model itself, the measurements, and, if used, the emulators representing the model. This thesis provides three major developments to the current methodology in this area. We develop sequential history matching methodology by splitting the available data into groups and gaining insight about the information obtained from each group. Such insight is then realised through a wide array of novel visualisations. We develop emulation techniques for the case when there are hypersurfaces of input space across which we have essentially perfect knowledge about the model’s behaviour. Finally, we have developed the use of history matching methodology as criteria for the design of physical system experiments. We outline the general framework for design in a history matching setting, before discussing many extensions, including the performance of a comprehensive robustness analysis on our design choice. We outline our novel methodology on a model of hormonal crosstalk in the roots of an Arabidopsis plant
Fuelling the zero-emissions road freight of the future: routing of mobile fuellers
The future of zero-emissions road freight is closely tied to the sufficient availability of new and clean fuel options such as electricity and Hydrogen. In goods distribution using Electric Commercial Vehicles (ECVs) and Hydrogen Fuel Cell Vehicles (HFCVs) a major challenge in the transition period would pertain to their limited autonomy and scarce and unevenly distributed refuelling stations. One viable solution to facilitate and speed up the adoption of ECVs/HFCVs by logistics, however, is to get the fuel to the point where it is needed (instead of diverting the route of delivery vehicles to refuelling stations) using "Mobile Fuellers (MFs)". These are mobile battery swapping/recharging vans or mobile Hydrogen fuellers that can travel to a running ECV/HFCV to provide the fuel they require to complete their delivery routes at a rendezvous time and space. In this presentation, new vehicle routing models will be presented for a third party company that provides MF services. In the proposed problem variant, the MF provider company receives routing plans of multiple customer companies and has to design routes for a fleet of capacitated MFs that have to synchronise their routes with the running vehicles to deliver the required amount of fuel on-the-fly. This presentation will discuss and compare several mathematical models based on different business models and collaborative logistics scenarios