122 research outputs found

    Structured sparse CCA for brain imaging genetics via graph OSCAR

    Get PDF
    Recently, structured sparse canonical correlation analysis (SCCA) has received increased attention in brain imaging genetics studies. It can identify bi-multivariate imaging genetic associations as well as select relevant features with desired structure information. These SCCA methods either use the fused lasso regularizer to induce the smoothness between ordered features, or use the signed pairwise difference which is dependent on the estimated sign of sample correlation. Besides, several other structured SCCA models use the group lasso or graph fused lasso to encourage group structure, but they require the structure/group information provided in advance which sometimes is not available

    Identifying Associations Between Brain Imaging Phenotypes and Genetic Factors via A Novel Structured SCCA Approach

    Get PDF
    Brain imaging genetics attracts more and more attention since it can reveal associations between genetic factors and the structures or functions of human brain. Sparse canonical correlation analysis (SCCA) is a powerful bi-multivariate association identification technique in imaging genetics. There have been many SCCA methods which could capture different types of structured imaging genetic relationships. These methods either use the group lasso to recover the group structure, or employ the graph/network guided fused lasso to find out the network structure. However, the group lasso methods have limitation in generalization because of the incomplete or unavailable prior knowledge in real world. The graph/network guided methods are sensitive to the sign of the sample correlation which may be incorrectly estimated. We introduce a new SCCA model using a novel graph guided pairwise group lasso penalty, and propose an efficient optimization algorithm. The proposed method has a strong upper bound for the grouping effect for both positively and negatively correlated variables. We show that our method performs better than or equally to two state-of-the-art SCCA methods on both synthetic and real neuroimaging genetics data. In particular, our method identifies stronger canonical correlations and captures better canonical loading profiles, showing its promise for revealing biologically meaningful imaging genetic associations

    Structured Sparse Methods for Imaging Genetics

    Get PDF
    abstract: Imaging genetics is an emerging and promising technique that investigates how genetic variations affect brain development, structure, and function. By exploiting disorder-related neuroimaging phenotypes, this class of studies provides a novel direction to reveal and understand the complex genetic mechanisms. Oftentimes, imaging genetics studies are challenging due to the relatively small number of subjects but extremely high-dimensionality of both imaging data and genomic data. In this dissertation, I carry on my research on imaging genetics with particular focuses on two tasks---building predictive models between neuroimaging data and genomic data, and identifying disorder-related genetic risk factors through image-based biomarkers. To this end, I consider a suite of structured sparse methods---that can produce interpretable models and are robust to overfitting---for imaging genetics. With carefully-designed sparse-inducing regularizers, different biological priors are incorporated into learning models. More specifically, in the Allen brain image--gene expression study, I adopt an advanced sparse coding approach for image feature extraction and employ a multi-task learning approach for multi-class annotation. Moreover, I propose a label structured-based two-stage learning framework, which utilizes the hierarchical structure among labels, for multi-label annotation. In the Alzheimer's disease neuroimaging initiative (ADNI) imaging genetics study, I employ Lasso together with EDPP (enhanced dual polytope projections) screening rules to fast identify Alzheimer's disease risk SNPs. I also adopt the tree-structured group Lasso with MLFre (multi-layer feature reduction) screening rules to incorporate linkage disequilibrium information into modeling. Moreover, I propose a novel absolute fused Lasso model for ADNI imaging genetics. This method utilizes SNP spatial structure and is robust to the choice of reference alleles of genotype coding. In addition, I propose a two-level structured sparse model that incorporates gene-level networks through a graph penalty into SNP-level model construction. Lastly, I explore a convolutional neural network approach for accurate predicting Alzheimer's disease related imaging phenotypes. Experimental results on real-world imaging genetics applications demonstrate the efficiency and effectiveness of the proposed structured sparse methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Sparse multivariate models for pattern detection in high-dimensional biological data

    No full text
    Recent advances in technology have made it possible and affordable to collect biological data of unprecedented size and complexity. While analysing such data, traditional statistical methods and machine learning algorithms suffer from the curse of dimensionality. Parsimonious models, which may refer to parsimony in model structure and/or model parameters, have been shown to improve both biological interpretability of the model and the generalisability to new data. In this thesis we are concerned with model selection in both supervised and unsupervised learning tasks. For supervised learnings, we propose a new penalty called graphguided group lasso (GGGL) and employ this penalty in penalised linear regressions. GGGL is able to integrate prior structured information with data mining, where variables sharing similar biological functions are collected into groups and the pairwise relatedness between groups are organised into a network. Such prior information will guide the selection of variables that are predictive to a univariate response, so that the model selects variable groups that are close in the network and important variables within the selected groups. We then generalise the idea of incorporating network-structured prior knowledge to association studies consisting of multivariate predictors and multivariate responses and propose the network-driven sparse reduced-rank regression (NsRRR). In NsRRR, pairwise relatedness between predictors and between responses are represented by two networks, and the model identifies associations between a subnetwork of predictors and a subnetwork of responses such that both subnetworks tend to be connected. For unsupervised learning, we are concerned with a multi-view learning task in which we compare the variance of high-dimensional biological features collected from multiple sources which are referred as “views”. We propose the sparse multi-view matrix factorisation (sMVMF) which is parsimonious in both model structure and model parameters. sMVMF can identify latent factors that regulate variability shared across all views and the variability which is characteristic to a specific view, respectively. For each novel method, we also present simulation studies and an application on real biological data to illustrate variable selection and model interpretability perspectives.Open Acces

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    IUPUI Open Access Publishing Fund: July 2013 – June 2018

    Get PDF

    IUPUI Open Access Publishing Fund: July 2013 – June 2020

    Get PDF

    Multi-Objective Optimization in Metabolomics/Computational Intelligence

    Get PDF
    The development of reliable computational models for detecting non-linear patterns encased in throughput datasets and characterizing them into phenotypic classes has been of particular interest and comprises dynamic studies in metabolomics and other disciplines that are encompassed within the omics science. Some of the clinical conditions that have been associated with these studies include metabotypes in cancer, in ammatory bowel disease (IBD), asthma, diabetes, traumatic brain injury (TBI), metabolic syndrome, and Parkinson's disease, just to mention a few. The traction in this domain is attributable to the advancements in the procedures involved in 1H NMR-linked datasets acquisition, which have fuelled the generation of a wide abundance of datasets. Throughput datasets generated by modern 1H NMR spectrometers are often characterized with features that are uninformative, redundant and inherently correlated. This renders it di cult for conventional multivariate analysis techniques to e ciently capture important signals and patterns. Therefore, the work covered in this research thesis provides novel alternative techniques to address the limitations of current analytical pipelines. This work delineates 13 variants of population-based nature inspired metaheuristic optimization algorithms which were further developed in this thesis as wrapper-based feature selection optimizers. The optimizers were then evaluated and benchmarked against each other through numerical experiments. Large-scale 1H NMR-linked datasets emerging from three disease studies were employed for the evaluations. The rst is a study in patients diagnosed with Malan syndrome; an autosomal dominant inherited disorder marked by a distinctive facial appearance, learning disabilities, and gigantism culminating in tall stature and macrocephaly, also referred to as cerebral gigantism. Another study involved Niemann-Pick Type C1 (NP-C1), a rare progressive neurodegenerative condition marked by intracellular accrual of cholesterol and complex lipids including sphingolipids and phospholipids in the endosomal/lysosomal system. The third study involved sore throat investigation in human (also known as `pharyngitis'); an acute infection of the upper respiratory tract that a ects the respiratory mucosa of the throat. In all three cases, samples from pathologically-con rmed cohorts with corresponding controls were acquired, and metabolomics investigations were performed using 1H NMR technique. Thereafter, computational optimizations were conducted on all three high-dimensional datasets that were generated from the disease studies outlined, so that key biomarkers and most e cient optimizers were identi ed in each study. The clinical and biochemical signi cance of the results arising from this work were discussed and highlighted
    • …
    corecore