60 research outputs found

    How to Improve Postgenomic Knowledge Discovery Using Imputation

    Get PDF
    While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulatory network (GRN) reconstruction. This has been the catalyst for a raft of new flexible imputation algorithms including local least square impute and the recent heuristic collateral missing value imputation, which exploit the biological transactional behaviour of functionally correlated genes to afford accurate missing value estimation. This paper examines the influence of missing value imputation techniques upon postgenomic knowledge inference methods with results for various algorithms consistently corroborating that instead of ignoring missing values, recycling microarray data by flexible and robust imputation can provide substantial performance benefits for subsequent downstream procedures

    Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics

    Get PDF
    Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables. There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi. Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast

    Immersive analytics for oncology patient cohorts

    Get PDF
    This thesis proposes a novel interactive immersive analytics tool and methods to interrogate the cancer patient cohort in an immersive virtual environment, namely Virtual Reality to Observe Oncology data Models (VROOM). The overall objective is to develop an immersive analytics platform, which includes a data analytics pipeline from raw gene expression data to immersive visualisation on virtual and augmented reality platforms utilising a game engine. Unity3D has been used to implement the visualisation. Work in this thesis could provide oncologists and clinicians with an interactive visualisation and visual analytics platform that helps them to drive their analysis in treatment efficacy and achieve the goal of evidence-based personalised medicine. The thesis integrates the latest discovery and development in cancer patients’ prognoses, immersive technologies, machine learning, decision support system and interactive visualisation to form an immersive analytics platform of complex genomic data. For this thesis, the experimental paradigm that will be followed is in understanding transcriptomics in cancer samples. This thesis specifically investigates gene expression data to determine the biological similarity revealed by the patient's tumour samples' transcriptomic profiles revealing the active genes in different patients. In summary, the thesis contributes to i) a novel immersive analytics platform for patient cohort data interrogation in similarity space where the similarity space is based on the patient's biological and genomic similarity; ii) an effective immersive environment optimisation design based on the usability study of exocentric and egocentric visualisation, audio and sound design optimisation; iii) an integration of trusted and familiar 2D biomedical visual analytics methods into the immersive environment; iv) novel use of the game theory as the decision-making system engine to help the analytics process, and application of the optimal transport theory in missing data imputation to ensure the preservation of data distribution; and v) case studies to showcase the real-world application of the visualisation and its effectiveness

    Modelling Transcriptional Regulation with a Mixture of Factor Analyzers and Variational Bayesian Expectation Maximization

    Get PDF
    Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference

    Non-Coding RNAs Improve the Predictive Power of Network Medicine

    Full text link
    Network Medicine has improved the mechanistic understanding of disease, offering quantitative insights into disease mechanisms, comorbidities, and novel diagnostic tools and therapeutic treatments. Yet, most network-based approaches rely on a comprehensive map of protein-protein interactions, ignoring interactions mediated by non-coding RNAs (ncRNAs). Here, we systematically combine experimentally confirmed binding interactions mediated by ncRNA with protein-protein interactions, constructing the first comprehensive network of all physical interactions in the human cell. We find that the inclusion of ncRNA, expands the number of genes in the interactome by 46% and the number of interactions by 107%, significantly enhancing our ability to identify disease modules. Indeed, we find that 132 diseases, lacked a statistically significant disease module in the protein-based interactome, but have a statistically significant disease module after inclusion of ncRNA-mediated interactions, making these diseases accessible to the tools of network medicine. We show that the inclusion of ncRNAs helps unveil disease-disease relationships that were not detectable before and expands our ability to predict comorbidity patterns between diseases. Taken together, we find that including non-coding interactions improves both the breath and the predictive accuracy of network medicine.Comment: Paper and S

    Prognostic Methods for Integrating Data from Complex Diseases

    Get PDF
    Statistics in medical research gained a vast surge with the development of high-throughput biotechnologies that provide thousands of measurements for each patient. These multi-layered data has the clear potential to improve the disease prognosis. Data integration is increasingly becoming essential in this context, to address problems such as increasing the power, inconsistencies between studies, obtaining more reliable biomarkers and gaining a broader understanding of the disease. This thesis focuses on addressing the challenges in the development of statistical methods while contributing to the methodological advancements in this field. We propose a clinical data analysis framework to obtain a model with good prediction accuracy addressing missing data and model instability. A detailed pre-processing pipeline is proposed for miRNA data that removes unwanted noise and offers improved concordance with qRT-PCR data. Platform specific models are developed to uncover biomarkers using mRNA, protein and miRNA data, to identify the source with the most important prognostic information. This thesis explores two types of data integration: horizontal; the integration of same type of data, and vertical; the integration of data from different platforms for the same patient. We use multiple miRNA datasets to develop a meta-analysis framework addressing the challenges in horizontal data integration using a multi-step validation protocol. In the vertical data integration, we extend the pre-validation principle and derive platform dependent weights to utilise the weighted Lasso. Our study revealed that integration of multi-layered data is instrumental in improving the prediction accuracy and in obtaining more biologically relevant biomarkers. A novel visualisation technique to look at prediction accuracy at patient level revealed vital findings with translational impact in personalised medicine
    corecore