371 research outputs found

    Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

    Get PDF
    Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis. Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters. Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets. In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    A primer on correlation-based dimension reduction methods for multi-omics analysis

    Full text link
    The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table

    Network inference from sparse single-cell transcriptomics data: Exploring, exploiting, and evaluating the single-cell toolbox

    Get PDF
    Large-scale transcriptomics data studies revolutionised the fields of systems biology and medicine, allowing to generate deeper mechanistic insights into biological pathways and molecular functions. However, conventional bulk RNA-sequencing results in the analysis of an averaged signal of many input cells, which are homogenised during the experimental procedure. Hence, those insights represent only a coarse-grained picture, potentially missing information from rare or unidentified cell types. Allowing for an unprecedented level of resolution, single-cell transcriptomics may help to identify and characterise new cell types, unravel developmental trajectories, and facilitate inference of cell type-specific networks. Besides all these tempting promises, there is one main limitation that currently hampers many downstream tasks: single-cell RNA-sequencing data is characterised by a high degree of sparsity. Due to this limitation, no reliable network inference tools allowed to disentangle the hidden information in the single-cell data. Single-cell correlation networks likely hold previously masked information and could allow inferring new insights into cell type-specific networks. To harness the potential of single-cell transcriptomics data, this dissertation sought to evaluate the influence of data dropout on network inference and how this might be alleviated. However, two premisses must be met to fulfil the promise of cell type-specific networks: (I) cell type annotation and (II) reliable network inference. Since any experimentally generated scRNA-seq data is associated with an unknown degree of dropout, a benchmarking framework was set up using a synthetic gold data set, which was subsequently affected with different defined degrees of dropout. Aiming to desparsify the dropout-afflicted data, the influence of various imputations tools on the network structure was further evaluated. The results highlighted that for moderate dropout levels, a deep count autoencoder (DCA) was able to outperform the other tools and the unimputed data. To fulfil the premiss of cell type annotation, the impact of data imputation on cell-cell correlations was investigated using a human retina organoid data set. The results highlighted that no imputation tool intervened with cell cluster annotation. Based on the encouraging results of the benchmarking analysis, a window of opportunity was identified, which allowed for meaningful network inference from imputed single-cell RNA-seq data. Therefore, the inference of cell type-specific networks subsequent to DCA-imputation was evaluated in a human retina organoid data set. To understand the differences and commonalities of cell type-specific networks, those were analysed for cones and rods, two closely related photoreceptor cell types of the retina. Comparing the importance of marker genes for rods and cones between their respective cell type-specific networks exhibited that these genes were of high importance, i.e. had hub-gene-like properties in one module of the corresponding network but were of less importance in the opposing network. Furthermore, it was analysed how many hub genes in general preserved their status across cell type-specific networks and whether they associate with similar or diverging sub-networks. While a set of preserved hub genes was identified, a few were linked to completely different network structures. One candidate was EIF4EBP1, a eukaryotic translation initiation factor binding protein, which is associated with a retinal pathology called age-related macular degeneration (AMD). These results suggest that given very defined prerequisites, data imputation via DCA can indeed facilitate cell type-specific network inference, delivering promising biological insights. Referring back to AMD, a major cause for the loss of central vision in patients older than 65, neither the defined mechanisms of pathogenesis nor treatment options are at hand. However, light can be shed on this disease through the employment of organoid model systems since they resemble the in vivo organ composition while reducing its complexity and ethical concerns. Therefore, a recently developed human retina organoid system (HRO) was investigated using the single-cell toolbox to evaluate whether it provides a useful base to study the defined effects on the onset and progression of AMD in the future. In particular, different workflows for a robust and in-depth annotation of cell types were used, including literature-based and transfer learning approaches. These allowed to state that the organoid system may reproduce hallmarks of a more central retina, which is an important determinant of AMD pathogenesis. Also, using trajectory analysis, it could be detected that the organoids in part reproduce major developmental hallmarks of the retina, but that different HRO samples exhibited developmental differences that point at different degrees of maturation. Altogether, this analysis allowed to deeply characterise a human retinal organoid system, which revealed in vivo-like outcomes and features as pinpointing discrepancies. These results could be used to refine culture conditions during the organoid differentiation to optimise its utility as a disease model. In summary, this dissertation describes a workflow that, in contrast to the current state of the art in the literature enables the inference of cell type-specific gene regulatory networks. The thesis illustrated that such networks indeed differ even between closely related cells. Thus, single-cell transcriptomics can yield unprecedented insights into so far not understood cell regulatory principles, particularly rare cell types that are so far hardly reflected in bulk-derived RNA-seq data

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    DEVELOPMENT OF BIOINFORMATICS TOOLS AND ALGORITHMS FOR IDENTIFYING PATHWAY REGULATORS, INFERRING GENE REGULATORY RELATIONSHIPS AND VISUALIZING GENE EXPRESSION DATA

    Get PDF
    In the era of genetics and genomics, the advent of big data is transforming the field of biology into a data-intensive discipline. Novel computational algorithms and software tools are in demand to address the data analysis challenges in this growing field. This dissertation comprises the development of a novel algorithm, web-based data analysis tools, and a data visualization platform. Triple Gene Mutual Interaction (TGMI) algorithm, presented in Chapter 2 is an innovative approach to identify key regulatory transcription factors (TFs) that govern a particular biological pathway or a process through interaction among three genes in a triple gene block, which consists of a pair of pathway genes and a TF. The identification of key TFs controlling a biological pathway or a process allows biologists to understand the complex regulatory mechanisms in living organisms. TF-Miner, presented in Chapter 3, is a high-throughput gene expression data analysis web application that was developed by integrating two highly efficient algorithms; TF-cluster and TF-Finder. TF-Cluster can be used to obtain collaborative TFs that coordinately control a biological pathway or a process using genome-wide expression data. On the other hand, TF-Finder can identify regulatory TFs involved in or associated with a specific biological pathway or a process using Adaptive Sparse Canonical Correlation Analysis (ASCCA). Chapter 4 presents ExactSearch; a suffix tree based motif search algorithm, implemented in a web-based tool. This tool can identify the locations of a set of motif sequences in a set of target promoter sequences. ExactSearch also provides the functionality to search for a set of motif sequences in flanking regions from 50 plant genomes, which we have incorporated into the web tool. Chapter 5 presents STTM JBrowse; a web-based RNA-Seq data visualization system built using the JBrowse open source platform. STTM JBrowse is a unified repository to share/produce visualizations created from large RNA-Seq datasets generated from a variety of model and crop plants in which miRNAs were destroyed using Short Tandem Target Mimic (STTM) Technology
    • …
    corecore