    Advancing Precision Medicine: Unveiling Disease Trajectories, Decoding Biomarkers, and Tailoring Individual Treatments

    Chronic diseases are not only prevalent but also exert a considerable strain on the healthcare system, individuals, and communities. Nearly half of all Americans suffer from at least one chronic disease, which is still growing. The development of machine learning has brought new directions to chronic disease analysis. Many data scientists have devoted themselves to understanding how a disease progresses over time, which can lead to better patient management, identification of disease stages, and targeted interventions. However, due to the slow progression of chronic disease, symptoms are barely noticed until the disease is advanced, challenging early detection. Meanwhile, chronic diseases often have diverse underlying causes and can manifest differently among patients. Besides the external factors, the development of chronic disease is also influenced by internal signals. The DNA sequence-level differences have been proven responsible for constant predisposition to chronic diseases. Given these challenges, data must be analyzed at various scales, ranging from single nucleotide polymorphisms (SNPs) to individuals and populations, to better understand disease mechanisms and provide precision medicine. Therefore, this research aimed to develop an automated pipeline from building predictive models and estimating individual treatment effects based on the structured data of general electronic health records (EHRs) to identifying genetic variations (e.g., SNPs) associated with diseases to unravel the genetic underpinnings of chronic diseases. First, we used structured EHRs to uncover chronic disease progression patterns and assess the dynamic contribution of clinical features. In this step, we employed causal inference methods (constraint-based and functional causal models) for feature selection and utilized Markov chains, attention long short-term memory (LSTM), and Gaussian process (GP). SHapley Additive exPlanations (SHAPs) and local interpretable model-agnostic explanations (LIMEs) further extended the work to identify important clinical features. Next, I developed a novel counterfactual-based method to predict individual treatment effects (ITE) from observational data. To discern a “balanced” representation so that treated and control distributions look similar, we disentangled the doctor’s preference from the covariance and rebuilt the representation of the treated and control groups. We use integral probability metrics to measure distances between distributions. The expected ITE estimation error of a representation was the sum of the standard generalization error of that representation and the distance between the distributions induced. Finally, we performed genome-wide association studies (GWAS) based on the stage information we extracted from our unsupervised disease progression model to identify the biomarkers and explore the genetic correction between the disease and its phenotypes

    Inferential stability in systems biology

    The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the “large p, small n” problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

    Discriminating active from latent tuberculosis in patients presenting to community clinics.

    BACKGROUND: Because of the high global prevalence of latent TB infection (LTBI), a key challenge in endemic settings is distinguishing patients with active TB from patients with overlapping clinical symptoms without active TB but with co-existing LTBI. Current methods are insufficiently accurate. Plasma proteomic fingerprinting can resolve this difficulty by providing a molecular snapshot defining disease state that can be used to develop point-of-care diagnostics. METHODS: Plasma and clinical data were obtained prospectively from patients attending community TB clinics in Peru and from household contacts. Plasma was subjected to high-throughput proteomic profiling by mass spectrometry. Statistical pattern recognition methods were used to define mass spectral patterns that distinguished patients with active TB from symptomatic controls with or without LTBI. RESULTS: 156 patients with active TB and 110 symptomatic controls (patients with respiratory symptoms without active TB) were investigated. Active TB patients were distinguishable from undifferentiated symptomatic controls with accuracy of 87% (sensitivity 84%, specificity 90%), from symptomatic controls with LTBI (accuracy of 87%, sensitivity 89%, specificity 82%) and from symptomatic controls without LTBI (accuracy 90%, sensitivity 90%, specificity 92%). CONCLUSIONS: We show that active TB can be distinguished accurately from LTBI in symptomatic clinic attenders using a plasma proteomic fingerprint. Translation of biomarkers derived from this study into a robust and affordable point-of-care format will have significant implications for recognition and control of active TB in high prevalence settings

    The Scarface Score: Deciphering Response to DNA Damage Agents in High-Grade Serous Ovarian Cancer—A GEICO Study

    Genomic instability; Machine learningInestabilidad genómica; Aprendizaje automáticoInestabilitat genòmica; Aprenentatge automàticGenomic Instability (GI) is a transversal phenomenon shared by several tumor types that provide both prognostic and predictive information. In the context of high-grade serous ovarian cancer (HGSOC), response to DNA-damaging agents such as platinum-based and poly(ADP-ribose) polymerase inhibitors (PARPi) has been closely linked to deficiencies in the DNA repair machinery by homologous recombination repair (HRR) and GI. In this study, we have developed the Scarface score, an integrative algorithm based on genomic and transcriptomic data obtained from the NGS analysis of a prospective GEICO cohort of 190 formalin-fixed paraffin-embedded (FFPE) tumor samples from patients diagnosed with HGSOC with a median follow up of 31.03 months (5.87–159.27 months). In the first step, three single-source models, including the SNP-based model (accuracy = 0.8077), analyzing 8 SNPs distributed along the genome; the GI-based model (accuracy = 0.9038) interrogating 28 parameters of GI; and the HTG-based model (accuracy = 0.8077), evaluating the expression of 7 genes related with tumor biology; were proved to predict response. Then, an ensemble model called the Scarface score was found to predict response to DNA-damaging agents with an accuracy of 0.9615 and a kappa index of 0.9128 (p < 0.0001). The Scarface Score approaches the routine establishment of GI in the clinical setting, enabling its incorporation as a predictive and prognostic tool in the management of HGSOC.This research was partially funded by GVA Grants “Subvencions per a la realització de projectes d’i+d+i desenvolupats per grups d’investigació emergents (GV/2020/158)” and “Ayudas para la contratación de personal investigador en formación de carácter predoctoral” (ACIF/2016/008) and “Beca de investigación traslacional Andrés Poveda 2020” from GEICO group. This study was awarded the Prize “Antonio Llombart Rodriguez-FINCIVO 2020” from the Royal Academy of Medicine of the Valencian Community

    Low-level visual processing and its relation to neurological disease

    Retinal neurons extract changes in image intensity across space, time, and wavelength. Retinal signal is transmitted to the early visual cortex, where the processing of low-level visual information occurs. The fundamental nature of these early visual pathways means that they are often compromised by neurological disease. This thesis had two aims. First, it aimed to investigate changes in visual processing in response to Parkinson’s disease (PD) by using electrophysiological recordings from animal models. Second, it aimed to use functional magnetic resonance imaging (fMRI) to investigate how low-level visual processes are represented in healthy human visual cortex, focusing on two pathways often compromised in disease; the magnocellular pathway and chromatic S-cone pathway. First, we identified a pathological mechanism of excitotoxicity in the visual system of Drosophila PD models. Next, we found that we could apply machine learning classifiers to multivariate visual response profiles recorded from the eye and brain of Drosophila and rodent PD models to accurately classify these animals into their correct class. Using fMRI and psychophysics, found that measurements of temporal contrast sensitivity differ as a function of visual space, with peripherally tuned voxels in early visual areas showing increased contrast sensitivity at a high temporal frequency. Finally, we used 7T fMRI to investigate systematic differences in achromatic and S-cone population receptive field (pRF) size estimates in the visual cortex of healthy humans. Unfortunately, we could not replicate the fundamental effect of pRF size increasing with eccentricity, indicating complications with our data and stimulus

    Detection of Epigenomic Network Community Oncomarkers

    In this paper we propose network methodology to infer prognostic cancer biomarkers based on the epigenetic pattern DNA methylation. Epigenetic processes such as DNA methylation reflect environmental risk factors, and are increasingly recognised for their fundamental role in diseases such as cancer. DNA methylation is a gene-regulatory pattern, and hence provides a means by which to assess genomic regulatory interactions. Network models are a natural way to represent and analyse groups of such interactions. The utility of network models also increases as the quantity of data and number of variables increase, making them increasingly relevant to large-scale genomic studies. We propose methodology to infer prognostic genomic networks from a DNA methylation-based measure of genomic interaction and association. We then show how to identify prognostic biomarkers from such networks, which we term `network community oncomarkers'. We illustrate the power of our proposed methodology in the context of a large publicly available breast cancer dataset

    Dual-intended deep learning model for breast cancer diagnosis in ultrasound imaging

    Automated medical data analysis demonstrated a significant role in modern medicine, and cancer diagnosis/prognosis to achieve highly reliable and generalizable systems. In this study, an automated breast cancer screening method in ultrasound imaging is proposed. A convolutional deep autoencoder model is presented for simultaneous segmentation and radiomic extraction. The model segments the breast lesions while concurrently extracting radiomic features. With our deep model, we perform breast lesion segmentation, which is linked to low-dimensional deep-radiomic extraction (four features). Similarly, we used high dimensional conventional imaging throughputs and applied spectral embedding techniques to reduce its size from 354 to 12 radiomics. A total of 780 ultrasound images—437 benign, 210, malignant, and 133 normal—were used to train and validate the models in this study. To diagnose malignant lesions, we have performed training, hyperparameter tuning, crossvalidation, and testing with a random forest model. This resulted in a binary classification accuracy of 78.5% (65.1–84.1%) for the maximal (full multivariate) cross-validated model for a combination of radiomic groups
