21 research outputs found

    Modelling Transcriptional Regulation with a Mixture of Factor Analyzers and Variational Bayesian Expectation Maximization

    Get PDF
    Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference

    Modelling transcriptional regulation with Gaussian processes

    Get PDF
    A challenging problem in systems biology is the quantitative modelling of transcriptional regulation. Transcription factors (TFs), which are the key proteins at the centre of the regulatory processes, may be subject to post-translational modification, rendering them unobservable at the mRNA level, or they may be controlled outside of the subsystem being modelled. In both cases, a mechanistic model description of the regula- tory system needs to be able to deal with latent activity profiles of the key regulators. A promising approach to deal with these difficulties is based on using Gaussian processes to define a prior distribution over the latent TF activity profiles. Inference is based on the principles of non-parametric Bayesian statistics, consistently inferring the posterior distribution of the unknown TF activities from the observed expression levels of potential target genes. The present work provides explicit solutions to the differ- ential equations needed to model the data in this manner, as well as the derivatives needed for effective optimisation. The work further explores identifiability issues not fully shown in previous work and looks at how this can cause difficulties with inference. We subsequently look at how the method works on two different TFs, including looking at how the model works with a more biologically realistic mechanistic model. Finally we analyse the effect of more biologically realistic non-Gaussian noise on the biologically realistic model showing how this can cause a reduction in the accuracy of the inference

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan vÀlillÀ metabolian sÀÀtelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistÀminen on tehokas tapa selvittÀÀ tuntemattomien geenien toiminnallisuutta sekÀ sÀÀtelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, vÀlillÀ. Riippuvuuksia eri biologisilla sÀÀtelytasoilla toimivien komponenttien vÀlillÀ voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: nÀytteiden mÀÀrÀ on usein riittÀmÀtön verrattuna aineiston piirteiden mÀÀrÀÀn, ja aineisto on multikollineaarista johtuen esim. yhdessÀ sÀÀdellyistÀ ja ilmentyvistÀ geeneistÀ. TÀstÀ syystÀ usein kÀytetÀÀn regularisoitua versiota kanonisesta korrelaatioanalyysistÀ aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lÀhestymistapa yhdessÀ sopivien priorioletuksien kanssa. TÀssÀ diplomityössÀ tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekÀ klassisen harjanneregressio-regularisoidun CCA:n kykyÀ löytÀÀ oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston vÀlillÀ. Uuden bayesilaisen menetelmÀn nimi on ryhmittÀin harva kanoninen korrelaatioanalyysi. RyhmittÀin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisÀksi uusi menetelmÀ regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiÀ sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiÀ sietÀvÀstÀ ekotyypistÀ Col-0 ja hapetus-stressille herkÀstÀ rcd1 mutantista aika-sarjana, sekÀ otsoni-altistuksessa ettÀ kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittÀin harvan CCA:n kykyÀ paljastaa jo tunnettuja ja mahdollisesti uusia sÀÀtelymekanismeja geenien ja metabolittien vÀlillÀ kasvisolujen viestinnÀssÀ hapettavan stressin aikana

    Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery

    Get PDF
    The discovery of transcription factor binding site (TFBS) motifs remains an important and challenging problem in computational biology. This thesis presents MITSU, a novel algorithm for TFBS motif discovery which exploits stochastic methods as a means of both overcoming optimality limitations in current algorithms and as a framework for incorporating relevant prior knowledge in order to improve results. The current state of the TFBS motif discovery field is surveyed, with a focus on probabilistic algorithms that typically take the promoter regions of coregulated genes as input. A case is made for an approach based on the stochastic Expectation- Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms for motif discovery is shown. The algorithm developed in this thesis is unique amongst existing motif discovery algorithms in that it combines the sEM algorithm with a derived data set which leads to an improved approximation to the likelihood function. This likelihood function is unconstrained with regard to the distribution of motif occurrences within the input dataset. MITSU also incorporates a novel heuristic to automatically determine TFBS motif width. This heuristic, known as MCOIN, is shown to outperform current methods for determining motif width. MITSU is implemented in Java and an executable is available for download. MITSU is evaluated quantitatively using realistic synthetic data and several collections of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates that MITSU improves on a deterministic EM-based motif discovery algorithm and an alternative sEM-based algorithm, in terms of previously established metrics. The ability of the sEM algorithm to escape stable fixed points of the EM algorithm, which trap deterministic motif discovery algorithms and the ability of MITSU to discover multiple motif occurrences within a single input sequence are also demonstrated. MITSU is validated using previously characterised Alphaproteobacterial motifs, before being applied to motif discovery in uncharacterised Alphaproteobacterial data. A number of novel results from this analysis are presented and motivate two extensions of MITSU: a strategy for the discovery of multiple different motifs within a single dataset and a higher order Markov background model. The effects of incorporating these extensions within MITSU are evaluated quantitatively using previously characterised prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs. Finally, an information-theoretic measure of motif palindromicity is presented and its advantages over existing approaches for discovering palindromic motifs discussed

    Integrating Physics Modelling with Machine Learning for Remote Sensing

    Get PDF
    L’observaciĂł de la Terra a partir de les dades proporcionades per sensors abord de satĂšl·lits, aixĂ­ com les proporcionades per models de transferĂšncia radiativa o climĂ tics, juntament amb les mesures in situ proporcionen una manera sense precedents de monitorar el nostre planeta amb millors resolucions espacials i temporals. La riquesa, quantitat i diversitat de les dades adquirides i posades a disposiciĂł tambĂ© augmenta molt rĂ pidament. Aquestes dades ens permeten predir el rendiment dels cultius, fer un seguiment del canvi d’Ășs del sĂČl com ara la desforestaciĂł, supervisar i respondre als desastres naturals, i predir i mitigar el canvi climĂ tic. Per tal de fer front a tots aquests reptes, les dues darreres dĂšcades han evidenciat un gran augment en l'aplicaciĂł d'algorismes d'aprenentatge automĂ tic en l'observaciĂł de la Terra. Amb l'anomenat `machine learning' es pot fer un Ășs eficient del flux de dades creixent en quantitat i diversitat. Els algorismes d'aprenentatge mĂ quina, perĂČ, solen ser models agnĂČstics i massa flexibles i, per tant, acaben per no respectar les lleis fonamentals de la fĂ­sica. D’altra banda, en els darrers anys s’ha produĂŻt un augment de la investigaciĂł que intenta integrar el coneixement de fĂ­sica en algorismes d’aprenentatge, amb la finalitat d’obtenir solucions interpretables i que tinguin sentit fĂ­sic. L’objectiu principal d’aquesta tesi Ă©s dissenyar diferents maneres de codificar el coneixement fĂ­sic per proporcionar mĂštodes d’aprenentatge automĂ tic adaptats a problemes especĂ­fics en teledetecciĂł. IntroduĂŻm nous mĂštodes que poden fusionar de manera ĂČptima fonts de dades heterogĂšnies, explotar les regularitats de dades, incorporar equacions diferencials, obtenir models precisos que emulen, i per tant sĂłn coherents amb models fĂ­sics, i models que aprenen parametrizacions del sistema combinant models i simulacions.Earth observation through satellite sensors, models and in situ measurements provides a way to monitor our planet with unprecedented spatial and temporal resolution. The amount and diversity of the data which is recorded and made available is ever-increasing. This data allows us to perform crop yield prediction, track land-use change such as deforestation, monitor and respond to natural disasters and predict and mitigate climate change. The last two decades have seen a large increase in the application of machine learning algorithms in Earth observation in order to make efficient use of the growing data-stream. Machine learning algorithms, however, are typically model agnostic and too flexible and so end up not respecting fundamental laws of physics. On the other hand there has, in recent years, been an increase in research attempting to embed physics knowledge in machine learning algorithms in order to obtain interpretable and physically meaningful solutions. The main objective of this thesis is to explore different ways of encoding physical knowledge to provide machine learning methods tailored for specific problems in remote sensing. Ways of expressing expert knowledge about the relevant physical systems in remote sensing abound, ranging from simple relations between reflectance indices and biophysical parameters to complex models that compute the radiative transfer of electromagnetic radiation through our atmosphere, and differential equations that explain the dynamics of key parameters. This thesis focuses on inversion problems, emulation of radiative transfer models, and incorporation of the abovementioned domain knowledge in machine learning algorithms for remote sensing applications. We explore new methods that can optimally model simulated and in-situ data jointly, incorporate differential equations in machine learning algorithms, handle more complex inversion problems and large-scale data, obtain accurate and computationally efficient emulators that are consistent with physical models, and that efficiently perform approximate Bayesian inversion over radiative transfer models

    Enhanced label-free discovery proteomics through improved data analysis and knowledge enrichment

    Get PDF
    Mass spectrometry (MS)-based proteomics has evolved into an important tool applied in fundamental biological research as well as biomedicine and medical research. The rapid developments of technology have required the establishment of data processing algorithms, protocols and workflows. The successful application of such software tools allows for the maturation of instrumental raw data into biological and medical knowledge. However, as the choice of algorithms is vast, the selection of suitable processing tools for various data types and research questions is not trivial. In this thesis, MS data processing related to the label-free technology is systematically considered. Essential questions, such as normalization, choice of preprocessing software, missing values and imputation, are reviewed in-depth. Considerations related to preprocessing of the raw data are complemented with exploration of methods for analyzing the processed data into practical knowledge. In particular, longitudinal differential expression is reviewed in detail, and a novel approach well-suited for noisy longitudinal high-througput data with missing values is suggested. Knowledge enrichment through integrated functional enrichment and network analysis is introduced for intuitive and information-rich delivery of the results. Effective visualization of such integrated networks enables the fast screening of results for the most promising candidates (e.g. clusters of co-expressing proteins with disease-related functions) for further validation and research. Finally, conclusions related to the prepreprocessing of the raw data are combined with considerations regarding longitudinal differential expression and integrated knowledge enrichment into guidelines for a potential label-free discovery proteomics workflow. Such proposed data processing workflow with practical suggestions for each distinct step, can act as a basis for transforming the label-free raw MS data into applicable knowledge.Massaspektrometriaan (MS) pohjautuva proteomiikka on kehittynyt tehokkaaksi työkaluksi, jota hyödynnetÀÀn niin biologisessa kuin lÀÀketieteellisessÀkin tutkimuksessa. Alan nopea kehitys on synnyttÀnyt erikoistuneita algoritmeja, protokollia ja ohjelmistoja datan kÀsittelyÀ varten. NÀiden ohjelmistotyökalujen oikeaoppinen kÀyttö lopulta mahdollistaa datan tehokkaan esikÀsittelyn, analysoinnin ja jatkojalostuksen biologiseksi tai lÀÀketieteelliseksi ymmÀrrykseksi. Mahdollisten vaihtoehtojen suuresta mÀÀrÀstÀ johtuen sopivan ohjelmistotyökalun valinta ei usein kuitenkaan ole yksiselitteistÀ ja ongelmatonta. TÀssÀ vÀitöskirjassa tarkastellaan leimaamattomaan proteomiikkaan liittyviÀ laskennallisia työkaluja. VÀitöskirjassa kÀydÀÀn lÀpi keskeisiÀ kysymyksiÀ datan normalisoinnista sopivan esikÀsittelyohjelmiston valintaan ja puuttuvien arvojen kÀsittelyyn. Datan esikÀsittelyn lisÀksi tarkastellaan datan tilastollista jatkoanalysointia sekÀ erityisesti erilaisen ekspression havaitsemista pitkittÀistutkimuksissa. VÀitöskirjassa esitellÀÀn uusi, kohinaiselle ja puuttuvia arvoja sisÀltÀvÀlle suurikapasiteetti-pitkittÀismittausdatalle soveltuva menetelmÀ erilaisen ekspression havaitsemiseksi. Uuden tilastollisen menetelmÀn lisÀksi vÀitöskirjassa tarkastellaan havaittujen tilastollisten löydösten rikastusta kÀytÀnnön ymmÀrrykseksi integroitujen rikastumis- ja verkkoanalyysien kautta. TÀllaisten funktionaalisten verkkojen tehokas visualisointi mahdollistaa keskeisten tulosten nopean tulkinnan ja kiinnostavimpien löydösten valinnan jatkotutkimuksia varten. Lopuksi datan esikÀsittelyyn ja pitkittÀistutkimusten tilastollisen jatkokÀsittelyyn liittyvÀt johtopÀÀtökset yhdistetÀÀn tiedollisen rikastamisen kanssa. NÀihin pohdintoihin perustuen esitellÀÀn mahdollinen työnkulku leimaamattoman MS proteomiikkadatan kÀsittelylle raakadatasta hyödynnettÀviksi löydöksiksi sekÀ edelleen kÀytÀnnön biologiseksi ja lÀÀketieteelliseksi ymmÀrrykseksi

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets
    corecore