154 research outputs found
Automatic Annotation of Spatial Expression Patterns via Sparse Bayesian Factor Models
Advances in reporters for gene expression have made it possible to document and quantify expression patterns in 2Dâ4D. In contrast to microarrays, which provide data for many genes but averaged and/or at low resolution, images reveal the high spatial dynamics of gene expression. Developing computational methods to compare, annotate, and model gene expression based on images is imperative, considering that available data are rapidly increasing. We have developed a sparse Bayesian factor analysis model in which the observed expression diversity of among a large set of high-dimensional images is modeled by a small number of hidden common factors. We apply this approach on embryonic expression patterns from a Drosophila RNA in situ image database, and show that the automatically inferred factors provide for a meaningful decomposition and represent common co-regulation or biological functions. The low-dimensional set of factor mixing weights is further used as features by a classifier to annotate expression patterns with functional categories. On human-curated annotations, our sparse approach reaches similar or better classification of expression patterns at different developmental stages, when compared to other automatic image annotation methods using thousands of hard-to-interpret features. Our study therefore outlines a general framework for large microscopy data sets, in which both the generative model itself, as well as its application for analysis tasks such as automated annotation, can provide insight into biological questions
Sparse graphical models for cancer signalling
Protein signalling networks play a key role in cellular function, and their dysregulation is central to many diseases, including cancer. Recent advances in biochemical technology have begun to allow high-throughput, data-driven studies of signalling. In this thesis, we investigate multivariate statistical methods, rooted in sparse graphical models, aimed at probing questions in cancer signalling.
First, we propose a Bayesian variable selection method for identifying subsets of proteins that jointly in uence an output of interest, such as drug response. Ancillary biological information is incorporated into inference using informative prior distributions. Prior information is selected and weighted in an automated manner using an empirical Bayes formulation. We present examples of informative pathway and network-based priors, and illustrate the proposed method on both synthetic and drug response data.
Second, we use dynamic Bayesian networks to perform structure learning of context-specific signalling network topology from proteomic time-course data. We exploit a connection between variable selection and network structure learning to efficiently carry out exact inference. Existing biology is incorporated using informative network priors, weighted automatically by an empirical Bayes approach. The overall approach is computationally efficient and essentially free of user-set parameters.
We show results from an empirical investigation, comparing the approach to several existing methods, and from an application to breast cancer cell line data. Hypotheses are generated regarding novel signalling links, some of which are validated by independent experiments.
Third, we describe a network-based clustering approach for the discovery of cancer subtypes that differ in terms of subtype-specific signalling network structure.
Model-based clustering is combined with penalised likelihood estimation of undirected graphical models to allow simultaneous learning of cluster assignments and cluster-specific network structure. Results are shown from an empirical investigation comparing several penalisation regimes, and an application to breast cancer proteomic data
Graphical Models for Multivariate Time-Series
Gaussian graphical models have received much attention in the last years, due
to their flexibility and expression power. In particular, lots of interests have
been devoted to graphical models for temporal data, or dynamical graphical
models, to understand the relation of variables evolving in time. While powerful
in modelling complex systems, such models suffer from computational
issues both in terms of convergence rates and memory requirements, and may
fail to detect temporal patterns in case the information on the system is partial.
This thesis comprises two main contributions in the context of dynamical
graphical models, tackling these two aspects: the need of reliable and fast
optimisation methods and an increasing modelling power, which are able to
retrieve the model in practical applications. The first contribution consists in a
forward-backward splitting (FBS) procedure for Gaussian graphical modelling
of multivariate time-series which relies on recent theoretical studies ensuring
global convergence under mild assumptions. Indeed, such FBS-based implementation
achieves, with fast convergence rates, optimal results with respect
to ground truth and standard methods for dynamical network inference. The
second main contribution focuses on the problem of latent factors, that influence
the system while hidden or unobservable. This thesis proposes the novel
latent variable time-varying graphical lasso method, which is able to take into
account both temporal dynamics in the data and latent factors influencing
the system. This is fundamental for the practical use of graphical models,
where the information on the data is partial. Indeed, extensive validation of
the method on both synthetic and real applications shows the effectiveness of
considering latent factors to deal with incomplete information
Knowledge-Informed Machine Learning for Cancer Diagnosis and Prognosis: A review
Cancer remains one of the most challenging diseases to treat in the medical
field. Machine learning has enabled in-depth analysis of rich multi-omics
profiles and medical imaging for cancer diagnosis and prognosis. Despite these
advancements, machine learning models face challenges stemming from limited
labeled sample sizes, the intricate interplay of high-dimensionality data
types, the inherent heterogeneity observed among patients and within tumors,
and concerns about interpretability and consistency with existing biomedical
knowledge. One approach to surmount these challenges is to integrate biomedical
knowledge into data-driven models, which has proven potential to improve the
accuracy, robustness, and interpretability of model results. Here, we review
the state-of-the-art machine learning studies that adopted the fusion of
biomedical knowledge and data, termed knowledge-informed machine learning, for
cancer diagnosis and prognosis. Emphasizing the properties inherent in four
primary data types including clinical, imaging, molecular, and treatment data,
we highlight modeling considerations relevant to these contexts. We provide an
overview of diverse forms of knowledge representation and current strategies of
knowledge integration into machine learning pipelines with concrete examples.
We conclude the review article by discussing future directions to advance
cancer research through knowledge-informed machine learning.Comment: 41 pages, 4 figures, 2 table
Sparse graphical models for cancer signalling
Protein signalling networks play a key role in cellular function, and their dysregulation is central to many diseases, including cancer. Recent advances in biochemical technology have begun to allow high-throughput, data-driven studies of signalling. In this thesis, we investigate multivariate statistical methods, rooted in sparse graphical models, aimed at probing questions in cancer signalling. First, we propose a Bayesian variable selection method for identifying subsets of proteins that jointly in uence an output of interest, such as drug response. Ancillary biological information is incorporated into inference using informative prior distributions. Prior information is selected and weighted in an automated manner using an empirical Bayes formulation. We present examples of informative pathwayand network-based priors, and illustrate the proposed method on both synthetic and drug response data. Second, we use dynamic Bayesian networks to perform structure learning of context-specific signalling network topology from proteomic time-course data. We exploit a connection between variable selection and network structure learning to efficiently carry out exact inference. Existing biology is incorporated using informative network priors, weighted automatically by an empirical Bayes approach. The overall approach is computationally efficient and essentially free of user-set parameters. We show results from an empirical investigation, comparing the approach to several existing methods, and from an application to breast cancer cell line data. Hypotheses are generated regarding novel signalling links, some of which are validated by independent experiments. Third, we describe a network-based clustering approach for the discovery of cancer subtypes that differ in terms of subtype-specific signalling network structure. Model-based clustering is combined with penalised likelihood estimation of undirected graphical models to allow simultaneous learning of cluster assignments and cluster-specific network structure. Results are shown from an empirical investigation comparing several penalisation regimes, and an application to breast cancer proteomic data.EThOS - Electronic Theses Online ServiceEngineering and Physical Sciences Research Council (EPSRC)GBUnited Kingdo
Applying the Free-Energy Principle to Complex Adaptive Systems
The free energy principle is a mathematical theory of the behaviour of self-organising systems that originally gained prominence as a unified model of the brain. Since then, the theory has been applied to a plethora of biological phenomena, extending from single-celled and multicellular organisms through to niche construction and human culture, and even the emergence of life itself. The free energy principle tells us that perception and action operate synergistically to minimize an organismâs exposure to surprising biological states, which are more likely to lead to decay. A key corollary of this hypothesis is active inferenceâthe idea that all behavior involves the selective sampling of sensory data so that we experience what we expect to (in order to avoid surprises). Simply put, we act upon the world to fulfill our expectations. It is now widely recognized that the implications of the free energy principle for our understanding of the human mind and behavior are far-reaching and profound. To date, however, its capacity to extend beyond our brainâto more generally explain living and other complex adaptive systemsâhas only just begun to be explored. The aim of this collection is to showcase the breadth of the free energy principle as a unified theory of complex adaptive systemsâconscious, social, living, or not
Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan vÀlillÀ metabolian sÀÀtelyn tutkimuksessa
Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors.
In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistÀminen on tehokas tapa selvittÀÀ tuntemattomien geenien toiminnallisuutta sekÀ sÀÀtelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, vÀlillÀ. Riippuvuuksia eri biologisilla sÀÀtelytasoilla toimivien komponenttien vÀlillÀ voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: nÀytteiden mÀÀrÀ on usein riittÀmÀtön verrattuna aineiston piirteiden mÀÀrÀÀn, ja aineisto on multikollineaarista johtuen esim. yhdessÀ sÀÀdellyistÀ ja ilmentyvistÀ geeneistÀ. TÀstÀ syystÀ usein kÀytetÀÀn regularisoitua versiota kanonisesta korrelaatioanalyysistÀ aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lÀhestymistapa yhdessÀ sopivien priorioletuksien kanssa.
TÀssÀ diplomityössÀ tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekÀ klassisen harjanneregressio-regularisoidun CCA:n kykyÀ löytÀÀ oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston vÀlillÀ. Uuden bayesilaisen menetelmÀn nimi on ryhmittÀin harva kanoninen korrelaatioanalyysi. RyhmittÀin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisÀksi uusi menetelmÀ regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiÀ sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiÀ sietÀvÀstÀ ekotyypistÀ Col-0 ja hapetus-stressille herkÀstÀ rcd1 mutantista aika-sarjana, sekÀ otsoni-altistuksessa ettÀ kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittÀin harvan CCA:n kykyÀ paljastaa jo tunnettuja ja mahdollisesti uusia sÀÀtelymekanismeja geenien ja metabolittien vÀlillÀ kasvisolujen viestinnÀssÀ hapettavan stressin aikana
Recipes for calibration and validation of agent-based models in cancer biomedicine
Computational models and simulations are not just appealing because of their
intrinsic characteristics across spatiotemporal scales, scalability, and
predictive power, but also because the set of problems in cancer biomedicine
that can be addressed computationally exceeds the set of those amenable to
analytical solutions. Agent-based models and simulations are especially
interesting candidates among computational modelling strategies in cancer
research due to their capabilities to replicate realistic local and global
interaction dynamics at a convenient and relevant scale. Yet, the absence of
methods to validate the consistency of the results across scales can hinder
adoption by turning fine-tuned models into black boxes. This review compiles
relevant literature to explore strategies to leverage high-fidelity simulations
of multi-scale, or multi-level, cancer models with a focus on validation
approached as simulation calibration. We argue that simulation calibration goes
beyond parameter optimization by embedding informative priors to generate
plausible parameter configurations across multiple dimensions
- âŠ