2,441 research outputs found
Advancing Precision Medicine: Unveiling Disease Trajectories, Decoding Biomarkers, and Tailoring Individual Treatments
Chronic diseases are not only prevalent but also exert a considerable strain on the healthcare system, individuals, and communities. Nearly half of all Americans suffer from at least one chronic disease, which is still growing. The development of machine learning has brought new directions to chronic disease analysis. Many data scientists have devoted themselves to understanding how a disease progresses over time, which can lead to better patient management, identification of disease stages, and targeted interventions. However, due to the slow progression of chronic disease, symptoms are barely noticed until the disease is advanced, challenging early detection. Meanwhile, chronic diseases often have diverse underlying causes and can manifest differently among patients. Besides the external factors, the development of chronic disease is also influenced by internal signals. The DNA sequence-level differences have been proven responsible for constant predisposition to chronic diseases. Given these challenges, data must be analyzed at various scales, ranging from single nucleotide polymorphisms (SNPs) to individuals and populations, to better understand disease mechanisms and provide precision medicine. Therefore, this research aimed to develop an automated pipeline from building predictive models and estimating individual treatment effects based on the structured data of general electronic health records (EHRs) to identifying genetic variations (e.g., SNPs) associated with diseases to unravel the genetic underpinnings of chronic diseases. First, we used structured EHRs to uncover chronic disease progression patterns and assess the dynamic contribution of clinical features. In this step, we employed causal inference methods (constraint-based and functional causal models) for feature selection and utilized Markov chains, attention long short-term memory (LSTM), and Gaussian process (GP). SHapley Additive exPlanations (SHAPs) and local interpretable model-agnostic explanations (LIMEs) further extended the work to identify important clinical features. Next, I developed a novel counterfactual-based method to predict individual treatment effects (ITE) from observational data. To discern a “balanced” representation so that treated and control distributions look similar, we disentangled the doctor’s preference from the covariance and rebuilt the representation of the treated and control groups. We use integral probability metrics to measure distances between distributions. The expected ITE estimation error of a representation was the sum of the standard generalization error of that representation and the distance between the distributions induced. Finally, we performed genome-wide association studies (GWAS) based on the stage information we extracted from our unsupervised disease progression model to identify the biomarkers and explore the genetic correction between the disease and its phenotypes
Recommended from our members
Modelling prognostic trajectories in Alzheimer’s disease
Progression to dementia due to Alzheimer’s Disease (AD) is a long and protracted process that involves multiple pathways of disease pathophysiology. Predicting these dynamic changes has major implications for timely and effective clinical management in AD. There are two reasons why at present we lack appropriate tools to make such predictions. First, a key feature of AD is the interactive nature of the relationships between biomarkers, such as accumulation of β-amyloid -a peptide that builds plaques between nerve cells-, tau -a protein found in the axons of nerve cells- and widespread neurodegeneration. Current models fail to capture these relationships because they are unable to successfully reduce the high dimensionality of biomarkers while exploiting informative multivariate relationships. Second, current models focus on simply predicting in a binary manner whether an individual will develop dementia due to AD or not, without informing clinicians about their predicted disease trajectory. This can result in administering inefficient treatment plans and hindering appropriate stratification for clinical trials. In this thesis, we overcome these challenges by using applied machine learning to build predictive models of patient disease trajectories in the earliest stages of AD. Specifically, to exploit the multi-dimensionality of biomarker data, we used a novel feature generation methodology Partial Least Squares regression with recursive feature elimination (PLSr-RFE). This method applies a hybrid-feature selection and feature construction method that captures co-morbidities in cognition and pathophysiology, resulting in an index of Alzheimer’s disease atrophy from structural MRI. We validated our choice of biomarker and the efficacy of our methodology by showing that the learnt pattern of grey matter atrophy is highly predictive of tau accumulation in an independent sample. Next, to go beyond predicting binary outcomes to deriving individualised prognostic scores of cognitive decline due to AD, we used a novel trajectory modelling approach (Generalised Metric Learning Vector Quantization – Scalar projection) that mines multimodal data from large AD research cohorts. Using this approach, we derive individualised prognostic scores of cognitive decline due to AD, revealing interactive cognitive, and biological factors that improve prediction accuracy. Next, we extended our machine learning framework to classify and stage early AD individuals based on future pathological tau accumulation. Our results show that the characteristic spreading pattern of tau in early AD can be predicted by baseline biomarkers, particularly when stratifying groups using multimodal data. Further, we showed that our prognostic index predicts individualised rates of future tau accumulation with high accuracy and regional specificity in an independent sample of cognitively unimpaired individuals. Overall, our work used machine learning to combine continuous information from AD biomarkers predicting pathophysiological changes at different stages in the AD cascade. The approaches presented in this thesis provide an excellent framework to support personalised clinical interventions and guide effective drug discovery trials
Inferential stability in systems biology
The modern biological sciences are fraught with statistical difficulties. Biomolecular
stochasticity, experimental noise, and the “large p, small n” problem all contribute to
the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful
conclusions from observations. In this thesis, we explore methods for assessing
the effects of data variability upon downstream inference, in an attempt to quantify and
promote the stability of the inferences we make.
We start with a review of existing methods for addressing this problem, focusing upon the
bootstrap and similar methods. The key requirement for all such approaches is a statistical
model that approximates the data generating process.
We move on to consider biomarker discovery problems. We present a novel algorithm for
proposing putative biomarkers on the strength of both their predictive ability and the stability
with which they are selected. In a simulation study, we find our approach to perform
favourably in comparison to strategies that select on the basis of predictive performance
alone.
We then consider the real problem of identifying protein peak biomarkers for HAM/TSP,
an inflammatory condition of the central nervous system caused by HTLV-1 infection.
We apply our algorithm to a set of SELDI mass spectral data, and identify a number of
putative biomarkers. Additional experimental work, together with known results from the
literature, provides corroborating evidence for the validity of these putative biomarkers.
Having focused on static observations, we then make the natural progression to time
course data sets. We propose a (Bayesian) bootstrap approach for such data, and then
apply our method in the context of gene network inference and the estimation of parameters
in ordinary differential equation models. We find that the inferred gene networks
are relatively unstable, and demonstrate the importance of finding distributions of ODE
parameter estimates, rather than single point estimates
Discriminating active from latent tuberculosis in patients presenting to community clinics.
BACKGROUND: Because of the high global prevalence of latent TB infection (LTBI), a key challenge in endemic settings is distinguishing patients with active TB from patients with overlapping clinical symptoms without active TB but with co-existing LTBI. Current methods are insufficiently accurate. Plasma proteomic fingerprinting can resolve this difficulty by providing a molecular snapshot defining disease state that can be used to develop point-of-care diagnostics. METHODS: Plasma and clinical data were obtained prospectively from patients attending community TB clinics in Peru and from household contacts. Plasma was subjected to high-throughput proteomic profiling by mass spectrometry. Statistical pattern recognition methods were used to define mass spectral patterns that distinguished patients with active TB from symptomatic controls with or without LTBI. RESULTS: 156 patients with active TB and 110 symptomatic controls (patients with respiratory symptoms without active TB) were investigated. Active TB patients were distinguishable from undifferentiated symptomatic controls with accuracy of 87% (sensitivity 84%, specificity 90%), from symptomatic controls with LTBI (accuracy of 87%, sensitivity 89%, specificity 82%) and from symptomatic controls without LTBI (accuracy 90%, sensitivity 90%, specificity 92%). CONCLUSIONS: We show that active TB can be distinguished accurately from LTBI in symptomatic clinic attenders using a plasma proteomic fingerprint. Translation of biomarkers derived from this study into a robust and affordable point-of-care format will have significant implications for recognition and control of active TB in high prevalence settings
The Scarface Score: Deciphering Response to DNA Damage Agents in High-Grade Serous Ovarian Cancer—A GEICO Study
Genomic instability; Machine learningInestabilidad genómica; Aprendizaje automáticoInestabilitat genòmica; Aprenentatge automà ticGenomic Instability (GI) is a transversal phenomenon shared by several tumor types that provide both prognostic and predictive information. In the context of high-grade serous ovarian cancer (HGSOC), response to DNA-damaging agents such as platinum-based and poly(ADP-ribose) polymerase inhibitors (PARPi) has been closely linked to deficiencies in the DNA repair machinery by homologous recombination repair (HRR) and GI. In this study, we have developed the Scarface score, an integrative algorithm based on genomic and transcriptomic data obtained from the NGS analysis of a prospective GEICO cohort of 190 formalin-fixed paraffin-embedded (FFPE) tumor samples from patients diagnosed with HGSOC with a median follow up of 31.03 months (5.87–159.27 months). In the first step, three single-source models, including the SNP-based model (accuracy = 0.8077), analyzing 8 SNPs distributed along the genome; the GI-based model (accuracy = 0.9038) interrogating 28 parameters of GI; and the HTG-based model (accuracy = 0.8077), evaluating the expression of 7 genes related with tumor biology; were proved to predict response. Then, an ensemble model called the Scarface score was found to predict response to DNA-damaging agents with an accuracy of 0.9615 and a kappa index of 0.9128 (p < 0.0001). The Scarface Score approaches the routine establishment of GI in the clinical setting, enabling its incorporation as a predictive and prognostic tool in the management of HGSOC.This research was partially funded by GVA Grants “Subvencions per a la realització de projectes d’i+d+i desenvolupats per grups d’investigació emergents (GV/2020/158)” and “Ayudas para la contratación de personal investigador en formación de carácter predoctoral” (ACIF/2016/008) and “Beca de investigación traslacional Andrés Poveda 2020” from GEICO group. This study was awarded the Prize “Antonio Llombart Rodriguez-FINCIVO 2020” from the Royal Academy of Medicine of the Valencian Community
Low-level visual processing and its relation to neurological disease
Retinal neurons extract changes in image intensity across space, time, and wavelength. Retinal signal is transmitted to the early visual cortex, where the processing of low-level visual information occurs. The fundamental nature of these early visual pathways means that they are often compromised by neurological disease. This thesis had two aims. First, it aimed to investigate changes in visual processing in response to Parkinson’s disease (PD) by using electrophysiological recordings from animal models. Second, it aimed to use functional magnetic resonance imaging (fMRI) to investigate how low-level visual processes are represented in healthy human visual cortex, focusing on two pathways often compromised in disease; the magnocellular pathway and chromatic S-cone pathway. First, we identified a pathological mechanism of excitotoxicity in the visual system of Drosophila PD models. Next, we found that we could apply machine learning classifiers to multivariate visual response profiles recorded from the eye and brain of Drosophila and rodent PD models to accurately classify these animals into their correct class. Using fMRI and psychophysics, found that measurements of temporal contrast sensitivity differ as a function of visual space, with peripherally tuned voxels in early visual areas showing increased contrast sensitivity at a high temporal frequency. Finally, we used 7T fMRI to investigate systematic differences in achromatic and S-cone population receptive field (pRF) size estimates in the visual cortex of healthy humans. Unfortunately, we could not replicate the fundamental effect of pRF size increasing with eccentricity, indicating complications with our data and stimulus
Detection of Epigenomic Network Community Oncomarkers
In this paper we propose network methodology to infer prognostic cancer
biomarkers based on the epigenetic pattern DNA methylation. Epigenetic
processes such as DNA methylation reflect environmental risk factors, and are
increasingly recognised for their fundamental role in diseases such as cancer.
DNA methylation is a gene-regulatory pattern, and hence provides a means by
which to assess genomic regulatory interactions. Network models are a natural
way to represent and analyse groups of such interactions. The utility of
network models also increases as the quantity of data and number of variables
increase, making them increasingly relevant to large-scale genomic studies. We
propose methodology to infer prognostic genomic networks from a DNA
methylation-based measure of genomic interaction and association. We then show
how to identify prognostic biomarkers from such networks, which we term
`network community oncomarkers'. We illustrate the power of our proposed
methodology in the context of a large publicly available breast cancer dataset
Dual-intended deep learning model for breast cancer diagnosis in ultrasound imaging
Automated medical data analysis demonstrated a significant role in modern medicine, and
cancer diagnosis/prognosis to achieve highly reliable and generalizable systems. In this study, an
automated breast cancer screening method in ultrasound imaging is proposed. A convolutional deep
autoencoder model is presented for simultaneous segmentation and radiomic extraction. The model
segments the breast lesions while concurrently extracting radiomic features. With our deep model,
we perform breast lesion segmentation, which is linked to low-dimensional deep-radiomic extraction
(four features). Similarly, we used high dimensional conventional imaging throughputs and applied
spectral embedding techniques to reduce its size from 354 to 12 radiomics. A total of 780 ultrasound
images—437 benign, 210, malignant, and 133 normal—were used to train and validate the models in
this study. To diagnose malignant lesions, we have performed training, hyperparameter tuning, crossvalidation, and testing with a random forest model. This resulted in a binary classification accuracy
of 78.5% (65.1–84.1%) for the maximal (full multivariate) cross-validated model for a combination of
radiomic groups
- …