23 research outputs found

    Network-guided data integration and gene prioritization

    Get PDF

    Modularity-based approaches to community detection in multilayer networks with applications toward precision medicine

    Get PDF
    Networks have become an important tool for the analysis of complex systems across many different disciplines including computer science, biology, chemistry, social sciences, and importantly, cancer medicine. Networks in the real world typically exhibit many forms of higher order organization. The subfield of networks analysis known as community detection aims to provide tools for discovering and interpreting the global structure of a networks-based on the connectivity patterns of its edges. In this thesis, we provide an overview of the methods for community detection in networks with an emphasis on modularity-based approaches. We discuss several caveats and drawbacks of currently available methods. We also review the success that network analyses have had in interpreting large scale 'omics' data in the context of cancer biology. In the second and third chapters, we present CHAMP and multimodbp, two useful community detection tools that seek to overcome several of the deficiencies in modularity-based community detection. In the final chapter, we develop a networks-based significance test for addressing an important question in the field of oncology: are mutations in DNA damage repair genes associated with elevated levels of tumor mutational burden. We apply the tools of network analysis to this question and showcase how this approach yields new insight into the structure of the problem, revealing what we call the TMB Paradox. We close by demonstrating the clinical utility of our findings in predicting patient response to novel immunotherapies.Doctor of Philosoph

    Network models of stochastic processes in cancer

    Get PDF
    Complex systems which can be modelled as networks are ubiquitous. Well-known examples include social and economic networks, as well as many examples in cell biology such as gene regulatory and protein signalling networks. Many cell biological processes are inherently stochastic and non-stationary, and this is the perspective from which I have developed novel mathematical and computational statistical models, focusing particularly on network models. These models are primarily motivated by cell biological processes relating to DNA methylation and stem cell and cancer biology, but can be generalised to other systems and domains. I have used these and other models to identify and analyse novel DNA-based cancer biomarkers

    Sparse multivariate models for pattern detection in high-dimensional biological data

    No full text
    Recent advances in technology have made it possible and affordable to collect biological data of unprecedented size and complexity. While analysing such data, traditional statistical methods and machine learning algorithms suffer from the curse of dimensionality. Parsimonious models, which may refer to parsimony in model structure and/or model parameters, have been shown to improve both biological interpretability of the model and the generalisability to new data. In this thesis we are concerned with model selection in both supervised and unsupervised learning tasks. For supervised learnings, we propose a new penalty called graphguided group lasso (GGGL) and employ this penalty in penalised linear regressions. GGGL is able to integrate prior structured information with data mining, where variables sharing similar biological functions are collected into groups and the pairwise relatedness between groups are organised into a network. Such prior information will guide the selection of variables that are predictive to a univariate response, so that the model selects variable groups that are close in the network and important variables within the selected groups. We then generalise the idea of incorporating network-structured prior knowledge to association studies consisting of multivariate predictors and multivariate responses and propose the network-driven sparse reduced-rank regression (NsRRR). In NsRRR, pairwise relatedness between predictors and between responses are represented by two networks, and the model identifies associations between a subnetwork of predictors and a subnetwork of responses such that both subnetworks tend to be connected. For unsupervised learning, we are concerned with a multi-view learning task in which we compare the variance of high-dimensional biological features collected from multiple sources which are referred as “views”. We propose the sparse multi-view matrix factorisation (sMVMF) which is parsimonious in both model structure and model parameters. sMVMF can identify latent factors that regulate variability shared across all views and the variability which is characteristic to a specific view, respectively. For each novel method, we also present simulation studies and an application on real biological data to illustrate variable selection and model interpretability perspectives.Open Acces

    Novel Semi-Supervised Learning Models to Balance Data Inclusivity and Usability in Healthcare Applications

    Get PDF
    abstract: Semi-supervised learning (SSL) is sub-field of statistical machine learning that is useful for problems that involve having only a few labeled instances with predictor (X) and target (Y) information, and abundance of unlabeled instances that only have predictor (X) information. SSL harnesses the target information available in the limited labeled data, as well as the information in the abundant unlabeled data to build strong predictive models. However, not all the included information is useful. For example, some features may correspond to noise and including them will hurt the predictive model performance. Additionally, some instances may not be as relevant to model building and their inclusion will increase training time and potentially hurt the model performance. The objective of this research is to develop novel SSL models to balance data inclusivity and usability. My dissertation research focuses on applications of SSL in healthcare, driven by problems in brain cancer radiomics, migraine imaging, and Parkinson’s Disease telemonitoring. The first topic introduces an integration of machine learning (ML) and a mechanistic model (PI) to develop an SSL model applied to predicting cell density of glioblastoma brain cancer using multi-parametric medical images. The proposed ML-PI hybrid model integrates imaging information from unbiopsied regions of the brain as well as underlying biological knowledge from the mechanistic model to predict spatial tumor density in the brain. The second topic develops a multi-modality imaging-based diagnostic decision support system (MMI-DDS). MMI-DDS consists of modality-wise principal components analysis to incorporate imaging features at different aggregation levels (e.g., voxel-wise, connectivity-based, etc.), a constrained particle swarm optimization (cPSO) feature selection algorithm, and a clinical utility engine that utilizes inverse operators on chosen principal components for white-box classification models. The final topic develops a new SSL regression model with integrated feature and instance selection called s2SSL (with “s2” referring to selection in two different ways: feature and instance). s2SSL integrates cPSO feature selection and graph-based instance selection to simultaneously choose the optimal features and instances and build accurate models for continuous prediction. s2SSL was applied to smartphone-based telemonitoring of Parkinson’s Disease patients.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201

    EMT Network-based Lung Cancer Prognosis Prediction

    Get PDF
    Network-based feature selection methods on omics data have been developed in recent years. Their performance gain, however, is shown to be affected by the datasets, networks, and evaluation metrics. The reproducibility and robustness of biomarkers await to be improved. In this endeavor, one of the major challenges is the curse of dimensionality. To mitigate this issue, we proposed the Phenotype Relevant Network-based Feature Selection (PRNFS) framework. By employing a much smaller but phenotype relevant network, we could avoid irrelevant information and select robust molecular signatures. The advantages of PRNFS were demonstrated with the application of lung cancer prognosis prediction. Specifically, we constructed epithelial mesenchymal transition (EMT) networks and employed them for feature selection. We mapped multiple types of omics data on it alternatively to select single-omics signatures and further integrated them into multi-omics signatures. Then we introduced a multiplex network-based feature selection method to directly select multi-omics signatures. Both single-omics and multi-omics EMT signatures were evaluated on TCGA data as well as an independent multi-omics dataset. The results showed that EMT signatures achieved significant performance gain, although EMT networks covered less than 2.5% of the original data dimensions. Frequently selected EMT features achieved average AUC values of 0.83 on TCGA data. Employing EMT signatures on the independent dataset stratified the patients into significantly different prognostic groups. Multi-omics features showed superior performance over single-omics features on both TCGA data and the independent data. Additionally, we tested the performance of a few relational and non-relational databases for storing and retrieving omics data. Since biological data have large volume, high velocity, and wide varieties, it is necessary to have database systems that meet the need of integrative omics data analysis. Based on the results, we provided a few advices on building scalable omics data infrastructures

    Development of unsupervised learning methods with applications to life sciences data

    Get PDF
    Machine Learning makes computers capable of performing tasks typically requiring human intelligence. A domain where it is having a considerable impact is the life sciences, allowing to devise new biological analysis protocols, develop patients’ treatments efficiently and faster, and reduce healthcare costs. This Thesis work presents new Machine Learning methods and pipelines for the life sciences focusing on the unsupervised field. At a methodological level, two methods are presented. The first is an “Ab Initio Local Principal Path” and it is a revised and improved version of a pre-existing algorithm in the manifold learning realm. The second contribution is an improvement over the Import Vector Domain Description (one-class learning) through the Kullback-Leibler divergence. It hybridizes kernel methods to Deep Learning obtaining a scalable solution, an improved probabilistic model, and state-of-the-art performances. Both methods are tested through several experiments, with a central focus on their relevance in life sciences. Results show that they improve the performances achieved by their previous versions. At the applicative level, two pipelines are presented. The first one is for the analysis of RNA-Seq datasets, both transcriptomic and single-cell data, and is aimed at identifying genes that may be involved in biological processes (e.g., the transition of tissues from normal to cancer). In this project, an R package is released on CRAN to make the pipeline accessible to the bioinformatic Community through high-level APIs. The second pipeline is in the drug discovery domain and is useful for identifying druggable pockets, namely regions of a protein with a high probability of accepting a small molecule (a drug). Both these pipelines achieve remarkable results. Lastly, a detour application is developed to identify the strengths/limitations of the “Principal Path” algorithm by analyzing Convolutional Neural Networks induced vector spaces. This application is conducted in the music and visual arts domains

    Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries

    Get PDF
    This two-volume set LNCS 12962 and 12963 constitutes the thoroughly refereed proceedings of the 7th International MICCAI Brainlesion Workshop, BrainLes 2021, as well as the RSNA-ASNR-MICCAI Brain Tumor Segmentation (BraTS) Challenge, the Federated Tumor Segmentation (FeTS) Challenge, the Cross-Modality Domain Adaptation (CrossMoDA) Challenge, and the challenge on Quantification of Uncertainties in Biomedical Image Quantification (QUBIQ). These were held jointly at the 23rd Medical Image Computing for Computer Assisted Intervention Conference, MICCAI 2020, in September 2021. The 91 revised papers presented in these volumes were selected form 151 submissions. Due to COVID-19 pandemic the conference was held virtually. This is an open access book

    Generalised temporal network inference

    Get PDF
    openNetwork inference is becoming increasingly central in the analysis of complex phenomena as it allows to obtain understandable models of entities interactions. Among the many possible graphical models, Markov Random Fields are widely used as they are strictly connected to a probability distribution assumption that allow to model a variety of different data. The inference of such models can be guided by two priors: sparsity and non-stationarity. In other words, only few connections are necessary to explain the phenomenon under observation and, as the phenomenon evolves, the underlying connections that explain it may change accordingly. This thesis contains two general methods for the inference of temporal graphical models that deeply rely on the concept of temporal consistency, i.e., the underlying structure of the system is similar (i.e., consistent) in time points that model the same behaviour (i.e., are dependent). The first contribution is a model that allows to be flexible in terms of probability assumption, temporal consistency, and dependency. The second contribution studies the previously introduces model in the presence of Gaussian partially un-observed data. Indeed, it is necessary to explicitly tackle the presence of un-observed data in order to avoid introducing misrepresentations in the inferred graphical model. All extensions are coupled with fast and non-trivial minimisation algorithms that are extensively validate on synthetic and real-world data. Such algorithms and experiments are implemented in a large and well-designed Python library that comprehends many tools for the modelling of multivariate data. Lastly, all the presented models have many hyper-parameters that need to be tuned on data. On this regard, we analyse different model selection strategies showing that a stability-based approach performs best in presence of multi-networks and multiple hyper-parameters.openXXXII CICLO - INFORMATICA E INGEGNERIA DEI SISTEMI/ COMPUTER SCIENCE AND SYSTEMS ENGINEERING - InformaticaTozzo, Veronic
    corecore