122 research outputs found
Structured sparse CCA for brain imaging genetics via graph OSCAR
Recently, structured sparse canonical correlation analysis (SCCA) has received increased attention in brain imaging genetics studies. It can identify bi-multivariate imaging genetic associations as well as select relevant features with desired structure information. These SCCA methods either use the fused lasso regularizer to induce the smoothness between ordered features, or use the signed pairwise difference which is dependent on the estimated sign of sample correlation. Besides, several other structured SCCA models use the group lasso or graph fused lasso to encourage group structure, but they require the structure/group information provided in advance which sometimes is not available
Identifying Associations Between Brain Imaging Phenotypes and Genetic Factors via A Novel Structured SCCA Approach
Brain imaging genetics attracts more and more attention since it can reveal associations between genetic factors and the structures or functions of human brain. Sparse canonical correlation analysis (SCCA) is a powerful bi-multivariate association identification technique in imaging genetics. There have been many SCCA methods which could capture different types of structured imaging genetic relationships. These methods either use the group lasso to recover the group structure, or employ the graph/network guided fused lasso to find out the network structure. However, the group lasso methods have limitation in generalization because of the incomplete or unavailable prior knowledge in real world. The graph/network guided methods are sensitive to the sign of the sample correlation which may be incorrectly estimated. We introduce a new SCCA model using a novel graph guided pairwise group lasso penalty, and propose an efficient optimization algorithm. The proposed method has a strong upper bound for the grouping effect for both positively and negatively correlated variables. We show that our method performs better than or equally to two state-of-the-art SCCA methods on both synthetic and real neuroimaging genetics data. In particular, our method identifies stronger canonical correlations and captures better canonical loading profiles, showing its promise for revealing biologically meaningful imaging genetic associations
Structured Sparse Methods for Imaging Genetics
abstract: Imaging genetics is an emerging and promising technique that investigates how genetic variations affect brain development, structure, and function. By exploiting disorder-related neuroimaging phenotypes, this class of studies provides a novel direction to reveal and understand the complex genetic mechanisms. Oftentimes, imaging genetics studies are challenging due to the relatively small number of subjects but extremely high-dimensionality of both imaging data and genomic data. In this dissertation, I carry on my research on imaging genetics with particular focuses on two tasks---building predictive models between neuroimaging data and genomic data, and identifying disorder-related genetic risk factors through image-based biomarkers. To this end, I consider a suite of structured sparse methods---that can produce interpretable models and are robust to overfitting---for imaging genetics. With carefully-designed sparse-inducing regularizers, different biological priors are incorporated into learning models. More specifically, in the Allen brain image--gene expression study, I adopt an advanced sparse coding approach for image feature extraction and employ a multi-task learning approach for multi-class annotation. Moreover, I propose a label structured-based two-stage learning framework, which utilizes the hierarchical structure among labels, for multi-label annotation. In the Alzheimer's disease neuroimaging initiative (ADNI) imaging genetics study, I employ Lasso together with EDPP (enhanced dual polytope projections) screening rules to fast identify Alzheimer's disease risk SNPs. I also adopt the tree-structured group Lasso with MLFre (multi-layer feature reduction) screening rules to incorporate linkage disequilibrium information into modeling. Moreover, I propose a novel absolute fused Lasso model for ADNI imaging genetics. This method utilizes SNP spatial structure and is robust to the choice of reference alleles of genotype coding. In addition, I propose a two-level structured sparse model that incorporates gene-level networks through a graph penalty into SNP-level model construction. Lastly, I explore a convolutional neural network approach for accurate predicting Alzheimer's disease related imaging phenotypes. Experimental results on real-world imaging genetics applications demonstrate the efficiency and effectiveness of the proposed structured sparse methods.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Sparse multivariate models for pattern detection in high-dimensional biological data
Recent advances in technology have made it possible and affordable to collect biological
data of unprecedented size and complexity. While analysing such data, traditional statistical
methods and machine learning algorithms suffer from the curse of dimensionality.
Parsimonious models, which may refer to parsimony in model structure and/or model parameters, have been shown to improve both biological interpretability of the model and the
generalisability to new data.
In this thesis we are concerned with model selection in both supervised and unsupervised
learning tasks. For supervised learnings, we propose a new penalty called graphguided
group lasso (GGGL) and employ this penalty in penalised linear regressions. GGGL
is able to integrate prior structured information with data mining, where variables sharing
similar biological functions are collected into groups and the pairwise relatedness between
groups are organised into a network. Such prior information will guide the selection of
variables that are predictive to a univariate response, so that the model selects variable
groups that are close in the network and important variables within the selected groups.
We then generalise the idea of incorporating network-structured prior knowledge to association
studies consisting of multivariate predictors and multivariate responses and propose
the network-driven sparse reduced-rank regression (NsRRR). In NsRRR, pairwise relatedness between predictors and between responses are represented by two networks, and
the model identifies associations between a subnetwork of predictors and a subnetwork of
responses such that both subnetworks tend to be connected. For unsupervised learning,
we are concerned with a multi-view learning task in which we compare the variance of
high-dimensional biological features collected from multiple sources which are referred
as “views”. We propose the sparse multi-view matrix factorisation (sMVMF) which is parsimonious in both model structure and model parameters. sMVMF can identify latent
factors that regulate variability shared across all views and the variability which is characteristic
to a specific view, respectively. For each novel method, we also present simulation
studies and an application on real biological data to illustrate variable selection and model
interpretability perspectives.Open Acces
Deep Learning in Single-Cell Analysis
Single-cell technologies are revolutionizing the entire field of biology. The
large volumes of data generated by single-cell technologies are
high-dimensional, sparse, heterogeneous, and have complicated dependency
structures, making analyses using conventional machine learning approaches
challenging and impractical. In tackling these challenges, deep learning often
demonstrates superior performance compared to traditional machine learning
methods. In this work, we give a comprehensive survey on deep learning in
single-cell analysis. We first introduce background on single-cell technologies
and their development, as well as fundamental concepts of deep learning
including the most popular deep architectures. We present an overview of the
single-cell analytic pipeline pursued in research applications while noting
divergences due to data sources or specific applications. We then review seven
popular tasks spanning through different stages of the single-cell analysis
pipeline, including multimodal integration, imputation, clustering, spatial
domain identification, cell-type deconvolution, cell segmentation, and
cell-type annotation. Under each task, we describe the most recent developments
in classical and deep learning methods and discuss their advantages and
disadvantages. Deep learning tools and benchmark datasets are also summarized
for each task. Finally, we discuss the future directions and the most recent
challenges. This survey will serve as a reference for biologists and computer
scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
Multi-Objective Optimization in Metabolomics/Computational Intelligence
The development of reliable computational models for detecting non-linear patterns
encased in throughput datasets and characterizing them into phenotypic classes
has been of particular interest and comprises dynamic studies in metabolomics
and other disciplines that are encompassed within the omics science. Some of the
clinical conditions that have been associated with these studies include metabotypes
in cancer, in
ammatory bowel disease (IBD), asthma, diabetes, traumatic brain
injury (TBI), metabolic syndrome, and Parkinson's disease, just to mention a few.
The traction in this domain is attributable to the advancements in the procedures
involved in 1H NMR-linked datasets acquisition, which have fuelled the generation of
a wide abundance of datasets. Throughput datasets generated by modern 1H NMR
spectrometers are often characterized with features that are uninformative, redundant
and inherently correlated. This renders it di cult for conventional multivariate
analysis techniques to e ciently capture important signals and patterns. Therefore,
the work covered in this research thesis provides novel alternative techniques to
address the limitations of current analytical pipelines. This work delineates 13 variants
of population-based nature inspired metaheuristic optimization algorithms which
were further developed in this thesis as wrapper-based feature selection optimizers.
The optimizers were then evaluated and benchmarked against each other through
numerical experiments. Large-scale 1H NMR-linked datasets emerging from three
disease studies were employed for the evaluations. The rst is a study in patients
diagnosed with Malan syndrome; an autosomal dominant inherited disorder marked
by a distinctive facial appearance, learning disabilities, and gigantism culminating
in tall stature and macrocephaly, also referred to as cerebral gigantism. Another
study involved Niemann-Pick Type C1 (NP-C1), a rare progressive neurodegenerative
condition marked by intracellular accrual of cholesterol and complex lipids including
sphingolipids and phospholipids in the endosomal/lysosomal system. The third
study involved sore throat investigation in human (also known as `pharyngitis'); an
acute infection of the upper respiratory tract that a ects the respiratory mucosa
of the throat. In all three cases, samples from pathologically-con rmed cohorts
with corresponding controls were acquired, and metabolomics investigations were
performed using 1H NMR technique. Thereafter, computational optimizations were
conducted on all three high-dimensional datasets that were generated from the disease
studies outlined, so that key biomarkers and most e cient optimizers were identi ed
in each study. The clinical and biochemical signi cance of the results arising from
this work were discussed and highlighted
- …