2,209 research outputs found
Discovering Patient Phenotypes Using Generalized Low Rank Models
The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients
to inform corresponding treatment. Given a patient grouping (hereafter referred to as a p henotype ), clinicians can
implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally,
phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these
approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our
understanding of disease has progressed substantially in the past century, there are still important domains in which
our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery,
researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by
missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low
Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we
analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains
upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second,
we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that
low rank modeling successfully captures known and putative phenotypes in these vastly different datasets
SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors
Existing tensor factorization methods assume that the input tensor follows
some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve
the factorization by minimizing some empirical loss functions defined based on
the corresponding distribution. However, it suffers from several drawbacks: 1)
In reality, the underlying distributions are complicated and unknown, making it
infeasible to be approximated by a simple distribution. 2) The correlation
across dimensions of the input tensor is not well utilized, leading to
sub-optimal performance. Although heuristics were proposed to incorporate such
correlation as side information under Gaussian distribution, they can not
easily be generalized to other distributions. Thus, a more principled way of
utilizing the correlation in tensor factorization models is still an open
challenge. Without assuming any explicit distribution, we formulate the tensor
factorization as an optimal transport problem with Wasserstein distance, which
can handle non-negative inputs.
We introduce SWIFT, which minimizes the Wasserstein distance that measures
the distance between the input tensor and that of the reconstruction. In
particular, we define the N-th order tensor Wasserstein loss for the widely
used tensor CP factorization and derive the optimization algorithm that
minimizes it. By leveraging sparsity structure and different equivalent
formulations for optimizing computational efficiency, SWIFT is as scalable as
other well-known CP algorithms. Using the factor matrices as features, SWIFT
achieves up to 9.65% and 11.31% relative improvement over baselines for
downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to
15% and 17% relative improvements over the best competitors for the prediction
tasks.Comment: Accepted by AAAI-2
Identification of gene pathways implicated in Alzheimer's disease using longitudinal imaging phenotypes with sparse regression
We present a new method for the detection of gene pathways associated with a
multivariate quantitative trait, and use it to identify causal pathways
associated with an imaging endophenotype characteristic of longitudinal
structural change in the brains of patients with Alzheimer's disease (AD). Our
method, known as pathways sparse reduced-rank regression (PsRRR), uses group
lasso penalised regression to jointly model the effects of genome-wide single
nucleotide polymorphisms (SNPs), grouped into functional pathways using prior
knowledge of gene-gene interactions. Pathways are ranked in order of importance
using a resampling strategy that exploits finite sample variability. Our
application study uses whole genome scans and MR images from 464 subjects in
the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. 66,182 SNPs
are mapped to 185 gene pathways from the KEGG pathways database. Voxel-wise
imaging signatures characteristic of AD are obtained by analysing 3D patterns
of structural change at 6, 12 and 24 months relative to baseline. High-ranking,
AD endophenotype-associated pathways in our study include those describing
chemokine, Jak-stat and insulin signalling pathways, and tight junction
interactions. All of these have been previously implicated in AD biology. In a
secondary analysis, we investigate SNPs and genes that may be driving pathway
selection, and identify a number of previously validated AD genes including
CR1, APOE and TOMM40
Integrative omics data analysis to discover novel signatures in complex diseases
Apart from diseases caused by the defect of a single gene, most diseases are highly complex
and are usually caused by a combination of biological and environmental factors. In the
biological context, cellular processes are often tightly connected across molecular layers of the
central dogma of biology, and the examination of a single layer would not be sufficient to
address disease pathology, therefore, conclusions drawn can be limited. Combining biological
observations from multiple layers or angles would greatly broaden our perspectives on the
disease in concern and may lead to novel discoveries which would not be possible to deduce
from a single-omics perspective. In this thesis, we focused on the method development for
single-cell transcriptomics to address the prime bias problem introduced by the new dropletbased technologies; integrative omics discovery of genomic signatures specific to different
brain regions in normal individuals; as well as the utilization of multiple omics to identify
potential biomarkers specific to amyotrophic lateral sclerosis (ALS) disease prognosis and
diagnosis.
Research has been revolutionized with the advent of single-cell omics technologies in the past
few decades and new methods and tools have also been developed to accommodate such
scientific accelerations. These innovations however posed new challenges and could
potentially introduce bias and unforeseeable circumstances if left unaddressed. Specifically, to
resolve the prime-based problem introduced by the current popular droplet-based single-cell
sequencing technologies which may lead to bias quantification, in Study I, we presented a novel
transcript quantification tool for droplet-based single-cell RNA-Sequencing (scRNA-Seq)
technologies and benchmarked our tool with other popular transcript and gene quantification
tools. Our tool outperformed currently popular tools in terms of transcript- and gene-level
quantifications.
In Study II, we investigated the association of splicing variants with the genetic patterns from
different regions of the brain in normal individuals to identify quantitative trait loci (QTL)
associated with ratios of isoform expression in genes. We carried out genome-wide association
studies (GWAS) on isoform ratios from 13 brain regions and identified isoform-ratio QTL
(irQTL) specific to each brain region, and their associated traits which could have been missed
by expression QTL derived from gene expressions.
We further looked into the utilization of proteomics and genomics data for ALS disease in
Study III to understand disease pathology from multiple perspectives, and to identify potential
protein biomarkers and protein QTL (pQTL) specific to different stages of the disease and
tissue sites. In terms of proteomics, for each tissue site, we identified potential protein
biomarkers specific to disease prognosis, survival of ALS patients, the functional decline
among ALS patients, and longitudinal changes after disease diagnosis. In terms of integrative
omics, we performed GWAS of protein expressions with genotyping data and identified tissuesite-specific pQTL signatures for ALS patients.
All in all, our studies showed efforts in developing a single-cell transcript quantification tool
to address potential bias problems with improved performance; identifying novel irQTL
signatures specific to various brain regions using an integrative omics approach; and also
discovering potential protein and genetic signatures for different tissues sites and pathological
stages in ALS disease using multiple omics. We hope our work could potentially enhance the
research process in various omics in terms of methods development and the novel signatures
could act as valuable resources for fostering further research ideas and potential experimental
validations
- ā¦