63 research outputs found
Approximate inference methods in probabilistic machine learning and Bayesian statistics
This thesis develops new methods for efficient approximate inference in probabilistic models. Such models are routinely used in different fields, yet they remain computationally challenging as they involve high-dimensional integrals. We propose different approximate inference approaches addressing some challenges in probabilistic machine learning and Bayesian statistics. First, we present a Bayesian framework for genome-wide inference of DNA methylation levels and devise an efficient particle filtering and smoothing algorithm that can be used to identify differentially methylated regions between case and control groups. Second, we present a scalable inference approach for state space models by combining variational methods with sequential Monte Carlo sampling. The method is applied to self-exciting point process models that allow for flexible dynamics in the latent intensity function. Third, a new variational density motivated by copulas is developed. This new variational family can be beneficial compared with Gaussian approximations, as illustrated on examples with Bayesian neural networks. Lastly, we make some progress in a gradient-based adaptation of Hamiltonian Monte Carlo samplers by maximizing an approximation of the proposal entropy
Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain
The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
Machine learning approach to reconstructing signalling pathways and interaction networks in biology
In this doctoral thesis, I present my research into applying machine learning techniques
for reconstructing species interaction networks in ecology, reconstructing molecular
signalling pathways and gene regulatory networks in systems biology, and inferring
parameters in ordinary differential equation (ODE) models of signalling pathways.
Together, the methods I have developed for these applications demonstrate the usefulness
of machine learning for reconstructing networks and inferring network parameters
from data.
The thesis consists of three parts. The first part is a detailed comparison of applying
static Bayesian networks, relevance vector machines, and linear regression with L1
regularisation (LASSO) to the problem of reconstructing species interaction networks
from species absence/presence data in ecology (Faisal et al., 2010). I describe how I
generated data from a stochastic population model to test the different methods and
how the simulation study led us to introduce spatial autocorrelation as an important
covariate. I also show how we used the results of the simulation study to apply the
methods to presence/absence data of bird species from the European Bird Atlas.
The second part of the thesis describes a time-varying, non-homogeneous dynamic
Bayesian network model for reconstructing signalling pathways and gene regulatory
networks, based on L`ebre et al. (2010). I show how my work has extended this model
to incorporate different types of hierarchical Bayesian information sharing priors and
different coupling strategies among nodes in the network. The introduction of these
priors reduces the inference uncertainty by putting a penalty on the number of structure
changes among network segments separated by inferred changepoints (Dondelinger
et al., 2010; Husmeier et al., 2010; Dondelinger et al., 2012b). Using both synthetic
and real data, I demonstrate that using information sharing priors leads to a better reconstruction
accuracy of the underlying gene regulatory networks, and I compare the
different priors and coupling strategies. I show the results of applying the model to
gene expression datasets from Drosophila melanogaster and Arabidopsis thaliana, as
well as to a synthetic biology gene expression dataset from Saccharomyces cerevisiae.
In each case, the underlying network is time-varying; for Drosophila melanogaster, as
a consequence of measuring gene expression during different developmental stages;
for Arabidopsis thaliana, as a consequence of measuring gene expression for circadian
clock genes under different conditions; and for the synthetic biology dataset, as
a consequence of changing the growth environment. I show that in addition to inferring
sensible network structures, the model also successfully predicts the locations of changepoints.
The third and final part of this thesis is concerned with parameter inference in
ODE models of biological systems. This problem is of interest to systems biology
researchers, as kinetic reaction parameters can often not be measured, or can only be
estimated imprecisely from experimental data. Due to the cost of numerically solving
the ODE system after each parameter adaptation, this is a computationally challenging
problem. Gradient matching techniques circumvent this problem by directly fitting the
derivatives of the ODE to the slope of an interpolant. I present an inference procedure
for a model using nonparametric Bayesian statistics with Gaussian processes, based
on Calderhead et al. (2008). I show that the new inference procedure improves on
the original formulation in Calderhead et al. (2008) and I present the result of applying
it to ODE models of predator-prey interactions, a circadian clock gene, a signal
transduction pathway, and the JAK/STAT pathway
Recommended from our members
Scalable Tools for Information Extraction and Causal Modeling of Neural Data
Systems neuroscience has entered in the past 20 years into an era that one might call "large scale systems neuroscience". From tuning curves and single neuron recordings there has been a conceptual shift towards a more holistic understanding of how the neural circuits work and as a result how their representations produce neural tunings.
With the introduction of a plethora of datasets in various scales, modalities, animals, and systems; we as a community have witnessed invaluable insights that can be gained from the collective view of a neural circuit which was not possible with small scale experimentation. The concurrency of the advances in neural recordings such as the production of wide field imaging technologies and neuropixels with the developments in statistical machine learning and specifically deep learning has brought system neuroscience one step closer to data science. With this abundance of data, the need for developing computational models has become crucial. We need to make sense of the data, and thus we need to build models that are constrained up to the acceptable amount of biological detail and probe those models in search of neural mechanisms.
This thesis consists of sections covering a wide range of ideas from computer vision, statistics, machine learning, and dynamical systems. But all of these ideas share a common purpose, which is to help automate neuroscientific experimentation process in different levels. In chapters 1, 2, and 3, I develop tools that automate the process of extracting useful information from raw neuroscience data in the model organism C. elegans. The goal of this is to avoid manual labor and pave the way for high throughput data collection aiming at better quantification of variability across the population of worms. Due to its high level of structural and functional stereotypy, and its relative simplicity, the nematode C. elegans has been an attractive model organism for systems and developmental research. With 383 neurons in males and 302 neurons in hermaphrodites, the positions and function of neurons is remarkably conserved across individuals. Furthermore, C. elegans remains the only organism for which a complete cellular, lineage, and anatomical map of the entire nervous system has been described for both sexes. Here, I describe the analysis pipeline that we developed for the recently proposed NeuroPAL technique in C. elegans. Our proposed pipeline consists of atlas building (chapter 1), registration, segmentation, neural tracking (chapter 2), and signal extraction (chapter 3). I emphasize that categorizing the analysis techniques as a pipeline consisting of the above steps is general and can be applied to virtually every single animal model and emerging imaging modality. I use the language of probabilistic generative modeling and graphical models to communicate the ideas in a rigorous form, therefore some familiarity with those concepts could help the reader navigate through the chapters of this thesis more easily.
In chapters 4 and 5 I build models that aim to automate hypothesis testing and causal interrogation of neural circuits. The notion of functional connectivity (FC) has been instrumental in our understanding of how information propagates in a neural circuit. However, an important limitation is that current techniques do not dissociate between causal connections and purely functional connections with no mechanistic correspondence. I start chapter 4 by introducing causal inference as a unifying language for the following chapters. In chapter 4 I define the notion of interventional connectivity (IC) as a way to summarize the effect of stimulation in a neural circuit providing a more mechanistic description of the information flow. I then investigate which functional connectivity metrics are best predictive of IC in simulations and real data. Following this framework, I discuss how stimulations and interventions can be used to improve fitting and generalization properties of time series models. Building on the literature of model identification and active causal discovery I develop a switching time series model and a method for finding stimulation patterns that help the model to generalize to the vicinity of the observed neural trajectories. Finally in chapter 5 I develop a new FC metric that separates the transferred information from one variable to the other into unique and synergistic sources.
In all projects, I have abstracted out concepts that are specific to the datasets at hand and developed the methods in the most general form. This makes the presented methods applicable to a broad range of datasets, potentially leading to new findings. In addition, all projects are accompanied with extensible and documented code packages, allowing theorists to repurpose the modules for novel applications and experimentalists to run analysis on their datasets efficiently and scalably.
In summary my main contribution in this thesis are the following:
1) Building the first atlases of hermaphrodite and male C. elegans and developing a generic statistical framework for constructing atlases for a broad range of datasets.
2) Developing a semi-automated analysis pipeline for neural registration, segmentation, and tracking in C. elegans.
3) Extending the framework of non-negative matrix factorization to datasets with deformable motion and developing algorithms for joint tracking and signal demixing from videos of semi-immobilized C. elegans.
4) Defining the notion of interventional connectivity (IC) as a way to summarize the effect of stimulation in a neural circuit and investigating which functional connectivity metrics are best predictive of IC in simulations and real data.
5) Developing a switching time series model and a method for finding stimulation patterns that help the model to generalize to the vicinity of the observed neural trajectories.
6) Developing a new functional connectivity metric that separates the transferred information from one variable to the other into unique and synergistic sources.
7) Implementing extensible, well documented, open source code packages for each of the above contributions
Advances in approximate Bayesian computation and trans-dimensional sampling methodology
Bayesian statistical models continue to grow in complexity, driven
in part by a few key factors: the massive computational resources
now available to statisticians; the substantial gains made in
sampling methodology and algorithms such as Markov chain
Monte Carlo (MCMC), trans-dimensional MCMC (TDMCMC), sequential
Monte Carlo (SMC), adaptive algorithms and stochastic
approximation methods and approximate Bayesian computation (ABC);
and development of more realistic models for real world phenomena
as demonstrated in this thesis for financial models and
telecommunications engineering. Sophisticated statistical models
are increasingly proposed for practical solutions to real world problems in order to better capture salient features of
increasingly more complex data. With sophistication comes a
parallel requirement for more advanced and automated statistical
computational methodologies.
The key focus of this thesis revolves around innovation related to
the following three significant Bayesian research questions.
1. How can one develop practically useful Bayesian models and corresponding computationally efficient sampling methodology, when the likelihood model is intractable?
2. How can one develop methodology in order to automate Markov chain Monte Carlo sampling approaches to efficiently explore the support of a posterior distribution, defined across multiple Bayesian statistical models?
3. How can these sophisticated Bayesian modelling frameworks and sampling methodologies be utilized to solve practically relevant and important problems in the research fields of financial risk modeling and telecommunications engineering ?
This thesis is split into three bodies of work represented in
three parts. Each part contains journal papers with novel
statistical model and sampling methodological development. The
coherent link between each part involves the novel
sampling methodologies developed in Part I and utilized in Part II and Part III. Papers contained in
each part make progress at addressing the core research
questions posed.
Part I of this thesis presents generally applicable key
statistical sampling methodologies that will be utilized and
extended in the subsequent two parts. In particular it presents
novel developments in statistical methodology pertaining to
likelihood-free or ABC and TDMCMC methodology.
The TDMCMC methodology focuses on several aspects of automation
in the between model proposal construction, including
approximation of the optimal between model proposal kernel via a
conditional path sampling density estimator. Then this methodology
is explored for several novel Bayesian model selection
applications including cointegrated vector autoregressions (CVAR)
models and mixture models in which there is an unknown number of
mixture components. The second area relates to development of
ABC methodology with particular focus
on SMC Samplers methodology in an ABC context via Partial
Rejection Control (PRC). In addition to novel algorithmic
development, key theoretical properties are also studied for the
classes of algorithms developed. Then this methodology is
developed for a highly challenging practically significant
application relating to multivariate Bayesian -stable
models.
Then Part II focuses on novel statistical model development
in the areas of financial risk and non-life insurance claims
reserving. In each of the papers in this part the focus is on
two aspects: foremost the development of novel statistical models
to improve the modeling of risk and insurance; and then the
associated problem of how to fit and sample from such statistical
models efficiently. In particular novel statistical models are
developed for Operational Risk (OpRisk) under a Loss Distributional
Approach (LDA) and for claims reserving in Actuarial non-life
insurance modelling. In each case the models developed include an
additional level of complexity which adds flexibility to the model
in order to better capture salient features observed in real data.
The consequence of the additional complexity comes at the cost
that standard fitting and sampling methodologies are generally not
applicable, as a result one is required to develop and apply the
methodology from Part I.
Part III focuses on novel statistical model development
in the area of statistical signal processing for wireless
communications engineering. Statistical models will be developed
or extended for two general classes of wireless communications
problem: the first relates to detection of transmitted symbols and
joint channel estimation in Multiple Input Multiple Output (MIMO)
systems coupled with Orthogonal Frequency Division Multiplexing
(OFDM); the second relates to co-operative wireless communications
relay systems in which the key focus is on detection of
transmitted symbols. Both these areas will require advanced
sampling methodology developed in Part I to find solutions to
these real world engineering problems
Mathematical modelling of the floral transition — with a Bayesian flourish —
Flowering plants are abundant on Earth. In the model dicot plant species, Arabidopsis thaliana, multiple endogenous and exogenous signals converge to initiate a change from vegetative to reproductive growth in optimal environmental conditions. Much genetic and experimental research has gone into elucidating the biological mechanisms
controlling the floral transition. However there has been little mathematical modelling of this process.
The aim of this thesis was to gain an understanding of the essential features and dynamic properties underlying this developmental phase change from a systems and computational biology perspective. Combining mathematical modelling with experimental results a core regulatory network was defined with multiple feedback loops. Simplified models inevitably miss finer details of the biological system yet they provide a route to understanding the overall system behaviour.This reductionist path allowed a tractable genetic regulatory network to be investigated without large numbers of parameters.
Not overfitting to data and parameter inference are two current challenges in systems biology. Treating all unknowns as a probability within the setting of Bayes’ theorem as a statistical framework allows for a solution to both of these issues. This thesis investigates the use of a contemporary Bayesian inference algorithm, nested sampling, for inference problems typically found in systems biology where the data are few and noisy. Nested sampling simultaneously calculates the key term for model comparison and also produces parameter inferences allowing uncertainty in models and predictions to be robustly quantified.
Network models are developed that can accurately reproduce experimental leaf number data, show important properties of the floral transition such as the ability to filter environmental noise and provide a clue on spatial patterning of an Arabidopsis shoot apex. Incorporating network knowledge into a plant breeding program is an exciting goal for future developments addressing global food security
- …