2,209 research outputs found

    Discovering Patient Phenotypes Using Generalized Low Rank Models

    Get PDF
    The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients to inform corresponding treatment. Given a patient grouping (hereafter referred to as a p henotype ), clinicians can implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally, phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our understanding of disease has progressed substantially in the past century, there are still important domains in which our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery, researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second, we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that low rank modeling successfully captures known and putative phenotypes in these vastly different datasets

    SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors

    Full text link
    Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distributions are complicated and unknown, making it infeasible to be approximated by a simple distribution. 2) The correlation across dimensions of the input tensor is not well utilized, leading to sub-optimal performance. Although heuristics were proposed to incorporate such correlation as side information under Gaussian distribution, they can not easily be generalized to other distributions. Thus, a more principled way of utilizing the correlation in tensor factorization models is still an open challenge. Without assuming any explicit distribution, we formulate the tensor factorization as an optimal transport problem with Wasserstein distance, which can handle non-negative inputs. We introduce SWIFT, which minimizes the Wasserstein distance that measures the distance between the input tensor and that of the reconstruction. In particular, we define the N-th order tensor Wasserstein loss for the widely used tensor CP factorization and derive the optimization algorithm that minimizes it. By leveraging sparsity structure and different equivalent formulations for optimizing computational efficiency, SWIFT is as scalable as other well-known CP algorithms. Using the factor matrices as features, SWIFT achieves up to 9.65% and 11.31% relative improvement over baselines for downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to 15% and 17% relative improvements over the best competitors for the prediction tasks.Comment: Accepted by AAAI-2

    Identification of gene pathways implicated in Alzheimer's disease using longitudinal imaging phenotypes with sparse regression

    Get PDF
    We present a new method for the detection of gene pathways associated with a multivariate quantitative trait, and use it to identify causal pathways associated with an imaging endophenotype characteristic of longitudinal structural change in the brains of patients with Alzheimer's disease (AD). Our method, known as pathways sparse reduced-rank regression (PsRRR), uses group lasso penalised regression to jointly model the effects of genome-wide single nucleotide polymorphisms (SNPs), grouped into functional pathways using prior knowledge of gene-gene interactions. Pathways are ranked in order of importance using a resampling strategy that exploits finite sample variability. Our application study uses whole genome scans and MR images from 464 subjects in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. 66,182 SNPs are mapped to 185 gene pathways from the KEGG pathways database. Voxel-wise imaging signatures characteristic of AD are obtained by analysing 3D patterns of structural change at 6, 12 and 24 months relative to baseline. High-ranking, AD endophenotype-associated pathways in our study include those describing chemokine, Jak-stat and insulin signalling pathways, and tight junction interactions. All of these have been previously implicated in AD biology. In a secondary analysis, we investigate SNPs and genes that may be driving pathway selection, and identify a number of previously validated AD genes including CR1, APOE and TOMM40

    Integrative omics data analysis to discover novel signatures in complex diseases

    Get PDF
    Apart from diseases caused by the defect of a single gene, most diseases are highly complex and are usually caused by a combination of biological and environmental factors. In the biological context, cellular processes are often tightly connected across molecular layers of the central dogma of biology, and the examination of a single layer would not be sufficient to address disease pathology, therefore, conclusions drawn can be limited. Combining biological observations from multiple layers or angles would greatly broaden our perspectives on the disease in concern and may lead to novel discoveries which would not be possible to deduce from a single-omics perspective. In this thesis, we focused on the method development for single-cell transcriptomics to address the prime bias problem introduced by the new dropletbased technologies; integrative omics discovery of genomic signatures specific to different brain regions in normal individuals; as well as the utilization of multiple omics to identify potential biomarkers specific to amyotrophic lateral sclerosis (ALS) disease prognosis and diagnosis. Research has been revolutionized with the advent of single-cell omics technologies in the past few decades and new methods and tools have also been developed to accommodate such scientific accelerations. These innovations however posed new challenges and could potentially introduce bias and unforeseeable circumstances if left unaddressed. Specifically, to resolve the prime-based problem introduced by the current popular droplet-based single-cell sequencing technologies which may lead to bias quantification, in Study I, we presented a novel transcript quantification tool for droplet-based single-cell RNA-Sequencing (scRNA-Seq) technologies and benchmarked our tool with other popular transcript and gene quantification tools. Our tool outperformed currently popular tools in terms of transcript- and gene-level quantifications. In Study II, we investigated the association of splicing variants with the genetic patterns from different regions of the brain in normal individuals to identify quantitative trait loci (QTL) associated with ratios of isoform expression in genes. We carried out genome-wide association studies (GWAS) on isoform ratios from 13 brain regions and identified isoform-ratio QTL (irQTL) specific to each brain region, and their associated traits which could have been missed by expression QTL derived from gene expressions. We further looked into the utilization of proteomics and genomics data for ALS disease in Study III to understand disease pathology from multiple perspectives, and to identify potential protein biomarkers and protein QTL (pQTL) specific to different stages of the disease and tissue sites. In terms of proteomics, for each tissue site, we identified potential protein biomarkers specific to disease prognosis, survival of ALS patients, the functional decline among ALS patients, and longitudinal changes after disease diagnosis. In terms of integrative omics, we performed GWAS of protein expressions with genotyping data and identified tissuesite-specific pQTL signatures for ALS patients. All in all, our studies showed efforts in developing a single-cell transcript quantification tool to address potential bias problems with improved performance; identifying novel irQTL signatures specific to various brain regions using an integrative omics approach; and also discovering potential protein and genetic signatures for different tissues sites and pathological stages in ALS disease using multiple omics. We hope our work could potentially enhance the research process in various omics in terms of methods development and the novel signatures could act as valuable resources for fostering further research ideas and potential experimental validations
    • ā€¦
    corecore