218 research outputs found

    Vertical integration of multiple high-dimensional datasets

    Get PDF
    Research in genomics and related fields now often requires the analysis of emph{multi-block} data, in which multiple high-dimensional types of data are available for a common set of objects. We introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of multi-block datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across datatypes, low-rank approximations for structured variation individual to each datatype, and residual noise. JIVE quantifies the amount of joint variation between datatypes, reduces the dimensionality of the data, and allows for the visual exploration of joint and individual structure. JIVE is an extension of Principal Components Analysis and has clear advantages over popular two-block methods such as Canonical Correlation and Partial Least Squares. Research in a number of fields also requires the analysis of emph{multi-way data}. Multi-way data take the form of a three (or higher) dimensional array. We compare several existing factorization methods for multi-way data, and we show that these methods belong to the same unified framework. The final portion of this dissertation concerns biclustering. We introduce an approach to biclustering a binary data matrix, and discuss the application of biclustering to classification problems

    Mass spectral imaging of clinical samples using deep learning

    Get PDF
    A better interpretation of tumour heterogeneity and variability is vital for the improvement of novel diagnostic techniques and personalized cancer treatments. Tumour tissue heterogeneity is characterized by biochemical heterogeneity, which can be investigated by unsupervised metabolomics. Mass Spectrometry Imaging (MSI) combined with Machine Learning techniques have generated increasing interest as analytical and diagnostic tools for the analysis of spatial molecular patterns in tissue samples. Considering the high complexity of data produced by the application of MSI, which can consist of many thousands of spectral peaks, statistical analysis and in particular machine learning and deep learning have been investigated as novel approaches to deduce the relationships between the measured molecular patterns and the local structural and biological properties of the tissues. Machine learning have historically been divided into two main categories: Supervised and Unsupervised learning. In MSI, supervised learning methods may be used to segment tissues into histologically relevant areas e.g. the classification of tissue regions in H&E (Haemotoxylin and Eosin) stained samples. Initial classification by an expert histopathologist, through visual inspection enables the development of univariate or multivariate models, based on tissue regions that have significantly up/down-regulated ions. However, complex data may result in underdetermined models, and alternative methods that can cope with high dimensionality and noisy data are required. Here, we describe, apply, and test a novel diagnostic procedure built using a combination of MSI and deep learning with the objective of delineating and identifying biochemical differences between cancerous and non-cancerous tissue in metastatic liver cancer and epithelial ovarian cancer. The workflow investigates the robustness of single (1D) to multidimensional (3D) tumour analyses and also highlights possible biomarkers which are not accessible from classical visual analysis of the H&E images. The identification of key molecular markers may provide a deeper understanding of tumour heterogeneity and potential targets for intervention.Open Acces

    Machine Learning/Deep Learning in Medical Image Processing

    Get PDF
    Many recent studies on medical image processing have involved the use of machine learning (ML) and deep learning (DL). This special issue, “Machine Learning/Deep Learning in Medical Image Processing”, has been launched to provide an opportunity for researchers in the area of medical image processing to highlight recent developments made in their fields with ML/DL. Seven excellent papers that cover a wide variety of medical/clinical aspects are selected in this special issue

    Assisted Network Analysis in Cancer Genomics

    Get PDF
    Cancer is a molecular disease. In the past two decades, we have witnessed a surge of high- throughput profiling in cancer research and corresponding development of high-dimensional statistical techniques. In this dissertation, the focus is on gene expression, which has played a uniquely important role in cancer research. Compared to some other types of molecular measurements, for example DNA changes, gene expressions are “closer” to cancer outcomes. In addition, processed gene expression data have good statistical properties, in particular, continuity. In the “early” cancer gene expression data analysis, attention has been on marginal properties such as mean and variance. Genes function in a coordinated way. As such, techniques that take a system perspective have been developed to also take into account the interconnections among genes. Among such techniques, graphical models, with lucid biological interpretations and satisfactory statistical properties, have attracted special attention. Graphical model-based analysis can not only lead to a deeper understanding of genes’ properties but also serve as a basis for other analyses, for example, regression and clustering. Cancer molecular studies usually have limited sizes. In the graphical model- based analysis, the number of parameters to be estimated gets squared. Combined together, they lead to a serious lack of information.The overarching goal of this dissertation is to conduct more effective graphical model analysis for cancer gene expression studies. One literature review and three methodological projects have been conducted. The overall strategy is to borrow strength from additional information so as to assist gene expression graphical model estimation. In the first chapter, the literature review is conducted. The methods developed in Chapter 2 and Chapter 4 take advantage of information on regulators of gene expressions (such as methylation, copy number variation, microRNA, and others). As they belong to the vertical data integration framework, we first provide a review of such data integration for gene expression data in Chapter 1. Additional, graphical model-based analysis for gene expression data is reviewed. Research reported in this chapter has led to a paper published in Briefings in Bioinformat- ics. In Chapters 2-4, to accommodate the extreme complexity of information-borrowing for graphical models, three different approaches have been proposed. In Chapter 2, two graphical models, with a gene-expression-only one and a gene-expression-regulator one, are simultaneously considered. A biologically sensible hierarchy between the sparsity structures of these two networks is developed, which is the first of its kind. This hierarchy is then used to link the estimation of the two graphical models. This work has led to a paper published in Genetic Epidemiology. In Chapter 3, additional information is mined from published literature, for example, those deposited at PubMed. The consideration is that published studies have been based on many independent experiments and can contain valuable in- formation on genes’ interconnections. The challenge is to recognize that such information can be partial or even wrong. A two-step approach, consisting of information-guided and information-incorporated estimations, is developed. This work has led to a paper published in Biometrics. In Chapter 4, we slightly shift attention and examine the difference in graphs, which has important implications for understanding cancer development and progression. Our strategy is to link changes in gene expression graphs with those in regulator graphs, which means additional information for estimation. It is noted that to make individual chapters standing-alone, there can be minor overlapping in descriptions. All methodological developments in this research fit the advanced penalization paradigm, which has been popular for cancer gene expression and other molecular data analysis. This methodological coherence is highly desirable. For the methods described in Chapters 2- 4, we have developed new penalized estimations which have lucid interpretations and can directly lead to variable selection (and so sparse and interpretable graphs). We have also developed effective computational algorithms and R codes, which have been made publicly available at Dr. Shuangge Ma’s Github software repository. For the methods described in Chapters 2 and 3, statistical properties under ultrahigh dimensional settings and mild regularity conditions have been established, providing the proposed methods a uniquely strong ground. Statistical properties for the method developed in Chapter 4 are relatively straightforward and hence are omitted. For all the proposed methods, we have conducted extensive simulations, comparisons with the most relevant competitors, and data analysis. The practical advantage is fully established. Overall, this research has delivered a practically sensible information-incorporating strategy for improving graphical model-based analysis for cancer gene expression data, multiple highly competitive methods, R programs that can have broad utilization, and new findings for multiple cancer types

    Outils statistiques pour la sélection de variables\ud et l'intégration de données "omiques"

    Get PDF
    Les récentes avancées biotechnologiques permettent maintenant de mesurer une\ud énorme quantité de données biologiques de différentes sources (données génomiques,\ud protémiques, métabolomiques, phénotypiques), souvent caractérisées par un petit nombre\ud d'échantillons ou d'observations.\ud L'objectif de ce travail est de développer ou d'adapter des méthodes statistiques\ud adéquates permettant d'analyser ces jeux de données de grande dimension, en proposant\ud aux biologistes des outils efficaces pour sélectionner les variables les plus pertinentes.\ud Dans un premier temps, nous nous intéressons spécifiquement aux données de\ud transcriptome et à la sélection de gènes discriminants dans un cadre de classification\ud supervisée. Puis, dans un autre contexte, nous cherchons a sélectionner des variables de\ud types différents lors de la réconciliation (ou l'intégration) de deux tableaux de données\ud omiques.\ud Dans la première partie de ce travail, nous proposons une approche de type\ud wrapper en agrégeant des méthodes de classification (CART, SVM) pour sélectionner\ud des gènes discriminants une ou plusieurs conditions biologiques. Dans la deuxième\ud partie, nous développons une approche PLS avec pénalisation l1 dite de type sparse\ud car conduisant à un ensemble "creux" de paramètres, permettant de sélectionner des\ud sous-ensembles de variables conjointement mesurées sur les mêmes échantillons biologiques.\ud Un cadre de régression, ou d'analyse canonique est propose pour répondre\ud spécifiquement a la question biologique.\ud Nous évaluons chacune des approches proposées en les comparant sur de nombreux\ud jeux de données réels a des méthodes similaires proposées dans la littérature.\ud Les critères statistiques usuels que nous appliquons sont souvent limitée par le petit\ud nombre d'échantillons. Par conséquent, nous nous efforcons de toujours combiner nos\ud évaluations statistiques avec une interprétation biologique détaillee des résultats.\ud Les approches que nous proposons sont facilement applicables et donnent des\ud résultats très satisfaisants qui répondent aux attentes des biologistes.------------------------------------------------------------------------------------Recent advances in biotechnology allow the monitoring of large quantities of\ud biological data of various types, such as genomics, proteomics, metabolomics, phenotypes...,\ud that are often characterized by a small number of samples or observations.\ud The aim of this thesis was to develop, or adapt, appropriate statistical methodologies\ud to analyse highly dimensional data, and to present ecient tools to biologists\ud for selecting the most biologically relevant variables. In the rst part, we focus on\ud microarray data in a classication framework, and on the selection of discriminative\ud genes. In the second part, in the context of data integration, we focus on the selection\ud of dierent types of variables with two-block omics data.\ud Firstly, we propose a wrapper method, which agregates two classiers (CART\ud or SVM) to select discriminative genes for binary or multiclass biological conditions.\ud Secondly, we develop a PLS variant called sparse PLS that adapts l1 penalization and\ud allows for the selection of a subset of variables, which are measured from the same\ud biological samples. Either a regression or canonical analysis frameworks are proposed\ud to answer biological questions correctly.\ud We assess each of the proposed approaches by comparing them to similar methods\ud known in the literature on numerous real data sets. The statistical criteria that\ud we use are often limited by the small number of samples. We always try, therefore, to\ud combine statistical assessments with a thorough biological interpretation of the results.\ud The approaches that we propose are easy to apply and give relevant results that\ud answer the biologists needs

    Meta-analysis of Incomplete Microarray Studies

    Get PDF
    Meta-analysis of microarray studies to produce an overall gene list is relatively straightforward when complete data are available. When some studies lack information, providing only a ranked list of genes, for example, it is common to reduce all studies to ranked lists prior to combining them. Since this entails a loss of information, we consider a hierarchical Bayes approach to meta-analysis using different types of information from different studies: the full data matrix, summary statistics or ranks. The model uses an informative prior for the parameter of interest to aid the detection of differentially expressed genes. Simulations show that the new approach can give substantial power gains compared to classical meta analysis and list aggregation methods. A meta-analysis of 11 published ovarian cancer studies with different data types identifies genes known to be involved in ovarian cancer, shows significant enrichment, while controlling the number of false positives. Independence of genes is a common assumption in microarray data analysis, and in the previous model, although it is not true in practice. Indeed, genes are activated in groups called modules: sets of co-regulated genes. These modules are usually defined by biologists, based on the position of the genes on the chromosome or known biological pathways (KEGG, GO for example). Our goal in the second part of this work is to be able to define modules common to several studies, in an automatic way. We use an empirical Bayes approach to estimate a sparse correlation matrix common to all studies, and identify modules by clustering. Simulations show that our approach performs as well or better than existing methods in terms of detection of modules across several datasets. We also develop a method based on extreme value theory to detect scattered genes, which do not belong to any module. This automatic module detection is very fast and produces accurate modules in our simulation studies. Application to real data results in a huge dimension reduction, which allows us to fit the hierarchical Bayesian model to modules, without the computational burden. Differentially expressed modules identified by this analysis present significant enrichment, indicating promising results of the method for future applications

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets
    • …
    corecore