175 research outputs found

    A sparse Bayesian learning method for structural equation model-based gene regulatory network inference

    Get PDF
    Gene regulatory networks (GRNs) are underlying networks identified by interactive relationships between genes. Reconstructing GRNs from massive genetic data is important for understanding gene functions and biological mechanism, and can provide effective service for medical treatment and genetic research. A series of artificial intelligence based methods have been proposed to infer GRNs from both gene expression data and genetic perturbations. The accuracy of such algorithms can be better than those models that just consider gene expression data. A structural equation model (SEM), which provides a systematic framework integrating both types of gene data conveniently, is a commonly used model for GRN inference. Considering the sparsity of GRNs, in this paper, we develop a novel sparse Bayesian inference algorithm based on Normal-Equation-Gamma (NEG) type hierarchical prior (BaNEG) to infer GRNs modeled with SEMs more accurately. First, we reparameterize an SEM as a linear type model by integrating the endogenous and exogenous variables; Then, a Bayesian adaptive lasso with a three-level NEG prior is applied to deduce the corresponding posterior mode and estimate the parameters. Simulations on synthetic data are run to compare the performance of BaNEG to some state-of-the-art algorithms, the results demonstrate that the proposed algorithm visibly outperforms the others. What’s more, BaNEG is applied to infer underlying GRNs from a real data set composed of 47 yeast genes from Saccharomyces cerevisiae to discover potential relationships between genes

    A sparse Bayesian learning method for structural equation model-based gene regulatory network inference

    Get PDF
    Gene regulatory networks (GRNs) are underlying networks identified by interactive relationships between genes. Reconstructing GRNs from massive genetic data is important for understanding gene functions and biological mechanism, and can provide effective service for medical treatment and genetic research. A series of artificial intelligence based methods have been proposed to infer GRNs from both gene expression data and genetic perturbations. The accuracy of such algorithms can be better than those models that just consider gene expression data. A structural equation model (SEM), which provides a systematic framework integrating both types of gene data conveniently, is a commonly used model for GRN inference. Considering the sparsity of GRNs, in this paper, we develop a novel sparse Bayesian inference algorithm based on Normal-Equation-Gamma (NEG) type hierarchical prior (BaNEG) to infer GRNs modeled with SEMs more accurately. First, we reparameterize an SEM as a linear type model by integrating the endogenous and exogenous variables; Then, a Bayesian adaptive lasso with a three-level NEG prior is applied to deduce the corresponding posterior mode and estimate the parameters. Simulations on synthetic data are run to compare the performance of BaNEG to some state-of-the-art algorithms, the results demonstrate that the proposed algorithm visibly outperforms the others. What’s more, BaNEG is applied to infer underlying GRNs from a real data set composed of 47 yeast genes from Saccharomyces cerevisiae to discover potential relationships between genes

    Integration and visualisation of clinical-omics datasets for medical knowledge discovery

    Get PDF
    In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

    Optimization of logical networks for the modelling of cancer signalling pathways

    Get PDF
    Cancer is one of the main causes of death throughout the world. The survival of patients diagnosed with various cancer types remains low despite the numerous progresses of the last decades. Some of the reasons for this unmet clinical need are the high heterogeneity between patients, the differentiation of cancer cells within a single tumor, the persistence of cancer stem cells, and the high number of possible clinical phenotypes arising from the combination of the genetic and epigenetic insults that confer to cells the functional characteristics enabling them to proliferate, evade the immune system and programmed cell death, and give rise to neoplasms. To identify new therapeutic options, a better understanding of the mechanisms that generate and maintain these functional characteristics is needed. As many of the alterations that characterize cancerous lesions relate to the signaling pathways that ensure the adequacy of cellular behavior in a specific micro-environment and in response to molecular cues, it is likely that increased knowledge about these signaling pathways will result in the identification of new pharmacological targets towards which new drugs can be designed. As such, the modeling of the cellular regulatory networks can play a prominent role in this understanding, as computational modeling allows the integration of large quantities of data and the simulation of large systems. Logical modeling is well adapted to the large-scale modeling of regulatory networks. Different types of logical network modeling have been used successfully to study cancer signaling pathways and investigate specific hypotheses. In this work we propose a Dynamic Bayesian Network framework to contextualize network models of signaling pathways. We implemented FALCON, a Matlab toolbox to formulate the parametrization of a prior-knowledge interaction network given a set of biological measurements under different experimental conditions. The FALCON toolbox allows a systems-level analysis of the model with the aim of identifying the most sensitive nodes and interactions of the inferred regulatory network and point to possible ways to modify its functional properties. The resulting hypotheses can be tested in the form of virtual knock-out experiments. We also propose a series of regularization schemes, materializing biological assumptions, to incorporate relevant research questions in the optimization procedure. These questions include the detection of the active signaling pathways in a specific context, the identification of the most important differences within a group of cell lines, or the time-frame of network rewiring. We used the toolbox and its extensions on a series of toy models and biological examples. We showed that our pipeline is able to identify cell type-specific parameters that are predictive of drug sensitivity, using a regularization scheme based on local parameter densities in the parameter space. We applied FALCON to the analysis of the resistance mechanism in A375 melanoma cells adapted to low doses of a TNFR agonist, and we accurately predict the re-sensitization and successful induction of apoptosis in the adapted cells via the silencing of XIAP and the down-regulation of NFkB. We further point to specific drug combinations that could be applied in the clinics. Overall, we demonstrate that our approach is able to identify the most relevant changes between sensitive and resistant cancer clones

    Regularisoitu riippuvuuksien mallintaminen geeniekpressio- ja metabolomiikkadatan välillä metabolian säätelyn tutkimuksessa

    Get PDF
    Fusing different high-throughput data sources is an effective way to reveal functions of unknown genes, as well as regulatory relationships between biological components such as genes and metabolites. Dependencies between biological components functioning in the different layers of biological regulation can be investigated using canonical correlation analysis (CCA). However, the properties of the high-throughput bioinformatics data induce many challenges to data analysis: the sample size is often insufficient compared to the dimensionality of the data, and the data pose multi-collinearity due to, for example, co-expressed and co-regulated genes. Therefore, a regularized version of classical CCA has been adopted. An alternative way of introducing regularization to statistical models is to perform Bayesian data analysis with suitable priors. In this thesis, the performance of a new variant of Bayesian CCA called gsCCA is compared to a classical ridge regression regularized CCA (rrCCA) in revealing relevant information shared between two high-throughput data sets. The gsCCA produces a partly similar regulatory effect as the classical CCA but, in addition, the gsCCA introduces a new type of regularization to the data covariance matrices. Both CCA methods are applied to gene expression and metabolic concentration measurements obtained from an oxidative-stress tolerant Arabidopsis thaliana ecotype Col-0, and an oxidative stress sensitive mutant rcd1 as time series under ozone exposure and in a control condition. The aim of this work is to reveal new regulatory mechanisms in the oxidative stress signalling in plants. For the both methods, rrCCA and gsCCA, the thesis illustrates their potential to reveal both already known and new regulatory mechanisms in Arabidopsis thaliana oxidative stress signalling.Bioinformatiikassa erityyppisten mittausaineistojen yhdistäminen on tehokas tapa selvittää tuntemattomien geenien toiminnallisuutta sekä säätelyvuorovaikutuksia eri biologisten komponenttien, kuten geenien ja metaboliittien, välillä. Riippuvuuksia eri biologisilla säätelytasoilla toimivien komponenttien välillä voidaan tutkia kanonisella korrelaatioanalyysilla (canonical correlation analysis, CCA). Bioinformatiikan tietoaineistot aiheuttavat kuitenkin monia haasteita data-analyysille: näytteiden määrä on usein riittämätön verrattuna aineiston piirteiden määrään, ja aineisto on multikollineaarista johtuen esim. yhdessä säädellyistä ja ilmentyvistä geeneistä. Tästä syystä usein käytetään regularisoitua versiota kanonisesta korrelaatioanalyysistä aineiston tilastolliseen analysointiin. Vaihtoehto regularisoidulle analyysille on bayesilainen lähestymistapa yhdessä sopivien priorioletuksien kanssa. Tässä diplomityössä tutkitaan ja vertaillaan uuden bayesilaisen CCA:n sekä klassisen harjanneregressio-regularisoidun CCA:n kykyä löytää oleellinen jaettu informaatio kahden bioinformatiikka-tietoaineiston välillä. Uuden bayesilaisen menetelmän nimi on ryhmittäin harva kanoninen korrelaatioanalyysi. Ryhmittäin harva CCA tuottaa samanlaisen regularisointivaikutuksen kuin harjanneregressio-CCA, mutta lisäksi uusi menetelmä regularisoi tietoaineistojen kovarianssimatriiseja uudella tavalla. Molempia CCA-menetelmiä sovelletaan geenien ilmentymisaineistoon ja metaboliittien konsentraatioaineistoon, jotka on mitattu Arabidopsis thaliana:n hapetus-stressiä sietävästä ekotyypistä Col-0 ja hapetus-stressille herkästä rcd1 mutantista aika-sarjana, sekä otsoni-altistuksessa että kontrolliolosuhteissa. Diplomityö havainnollistaa harjanneregressio-CCA:n ja ryhmittäin harvan CCA:n kykyä paljastaa jo tunnettuja ja mahdollisesti uusia säätelymekanismeja geenien ja metabolittien välillä kasvisolujen viestinnässä hapettavan stressin aikana

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph

    Modelling genetic and genomic interactions underlying gene expression and complex traits

    No full text
    This study focuses on integrating and applying computational techniques for modelling quantitative traits and complex diseases, such as hypertension and diabetes, using the rat model system and translating the findings to humans. Complex disease traits are heritable, highly polygenic, and influenced by environmental factors. Human studies, like Genome Wide Association Studies (GWAS), have identified many genetic determinants underlying these traits but have provided little information about the functional effects of these variants and mechanisms regulating the disease. This study takes a systems-level approach for looking at the genetic regulation of complex traits in the rat by analysing multiple phenotypes, genomewide genetic variation and gene expression data in multiple tissues. I integrated these multi-modality datasets in the BXH/HXB rat Recombinant Inbred (RI) lines, an established model of the human metabolic syndrome, to identify candidate genes, pathways and networks associated with complex disease phenotypes. I evaluated methods for Expression Quantitative Trait Locus (eQTL) analysis and used sparse Bayesian regression approaches to map eQTLs in the RI lines, delineating a new, large eQTL data resource for the rat genetic community. I have also developed and applied signal processing and time series analysis methods to physiological traits to extract more detailed indices of blood pressure, and integrated these with genetic, expression and eQTL data to inform on the regulation of these traits. Then, using publicly available data, I used comparative genomics approaches to elucidate a set of genes and pathways that can play a role in human diseases. This study has provided a valuable resource for future work in the rat, by means of new eQTLs in multiple tissues, and physiological time series phenotypes and approaches. This has enabled an integrative analysis of these data to give new insights into the regulation of complex traits in rats and humans

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF
    corecore