225 research outputs found
The FDA contribution to Health Data Science
This contribution aims at presenting examples of Health Data Science where advanced methods based on Functional Data Analysis are used to bring value to clinical and biological problems
Network analysis of comorbidity patterns in heart failure patients using administrative data
Background: Congestive Heart Failure (HF) is a widespread chronic disease characterized by a very high incidence in elder people. The high mortality and readmission rate of HF strongly depends on the complicated morbidity scenario often characterising it.
Methods: Data were retrieved from the healthcare administrative datawarehouse of Lombardy, the most populated regional district in Italy. Network analysis techniques and community detection algorithms are applied to comorbidities registered in hospital discharge papers of HF patients, in 7 cohorts between 2006 and 2012.Results: The relevance network indexes applied to the 7 cohorts identified death, ipertension, arrythmia, renal and pulmonary diseases as the most relevant nodes related to HF, in terms of prevalence and closeness/strenght of the relationship. Moreover, 3 clusters of nodes have been identified in all the cohorts, i.e. those related to cancer, lung diseases and heart/circulation related problems.Conclusions: Network analysis can be a useful tool in epidemiologic framework when relational data are the objective of the investigation, since it allows to visualize and make inference on patterns of association among nodes (here HF comorbidities) by means of both qualitative indexes and clustering techniques
Learning Signal Representations for EEG Cross-Subject Channel Selection and Trial Classification
EEG technology finds applications in several domains. Currently, most EEG
systems require subjects to wear several electrodes on the scalp to be
effective. However, several channels might include noisy information, redundant
signals, induce longer preparation times and increase computational times of
any automated system for EEG decoding. One way to reduce the signal-to-noise
ratio and improve classification accuracy is to combine channel selection with
feature extraction, but EEG signals are known to present high inter-subject
variability. In this work we introduce a novel algorithm for
subject-independent channel selection of EEG recordings. Considering
multi-channel trial recordings as statistical units and the EEG decoding task
as the class of reference, the algorithm (i) exploits channel-specific
1D-Convolutional Neural Networks (1D-CNNs) as feature extractors in a
supervised fashion to maximize class separability; (ii) it reduces a high
dimensional multi-channel trial representation into a unique trial vector by
concatenating the channels' embeddings and (iii) recovers the complex
inter-channel relationships during channel selection, by exploiting an ensemble
of AutoEncoders (AE) to identify from these vectors the most relevant channels
to perform classification. After training, the algorithm can be exploited by
transferring only the parametrized subgroup of selected channel-specific
1D-CNNs to new signals from new subjects and obtain low-dimensional and highly
informative trial vectors to be fed to any classifier
A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation
Recent evidence highlights the usefulness of DNA methylation (DNAm)
biomarkers as surrogates for exposure to risk factors for noncommunicable
diseases in epidemiological studies and randomized trials. DNAm variability
has been demonstrated to be tightly related to lifestyle behavior and exposure
to environmental risk factors, ultimately providing an unbiased proxy of
an individual state of health. At present, the creation of DNAm surrogates
relies on univariate penalized regression models, with elastic-net regularizer
being the gold standard when accomplishing the task. Nonetheless, more advanced
modeling procedures are required in the presence of multivariate outcomes
with a structured dependence pattern among the study samples. In this
work we propose a general framework for mixed-effects multitask learning
in presence of high-dimensional predictors to develop a multivariate DNAm
biomarker from a multicenter study. A penalized estimation scheme, based
on an expectation-maximization algorithm, is devised in which any penalty
criteria for fixed-effects models can be conveniently incorporated in the fitting
process. We apply the proposed methodology to create novel DNAm
surrogate biomarkers for multiple correlated risk factors for cardiovascular
diseases and comorbidities. We show that the proposed approach, modeling
multiple outcomes together, outperforms state-of-the-art alternatives both in
predictive power and biomolecular interpretation of the results
Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics
Survival analysis is a fundamental tool in medicine, modeling the time until
an event of interest occurs in a population. However, in real-world
applications, survival data are often incomplete, censored, distributed, and
confidential, especially in healthcare settings where privacy is critical. The
scarcity of data can severely limit the scalability of survival models to
distributed applications that rely on large data pools. Federated learning is a
promising technique that enables machine learning models to be trained on
multiple datasets without compromising user privacy, making it particularly
well-suited for addressing the challenges of survival data and large-scale
survival applications. Despite significant developments in federated learning
for classification and regression, many directions remain unexplored in the
context of survival analysis. In this work, we propose an extension of the
Federated Survival Forest algorithm, called FedSurF++. This federated ensemble
method constructs random survival forests in heterogeneous federations.
Specifically, we investigate several new tree sampling methods from client
forests and compare the results with state-of-the-art survival models based on
neural networks. The key advantage of FedSurF++ is its ability to achieve
comparable performance to existing methods while requiring only a single
communication round to complete. The extensive empirical investigation results
in a significant improvement from the algorithmic and privacy preservation
perspectives, making the original FedSurF algorithm more efficient, robust, and
private. We also present results on two real-world datasets demonstrating the
success of FedSurF++ in real-world healthcare studies. Our results underscore
the potential of FedSurF++ to improve the scalability and effectiveness of
survival analysis in distributed settings while preserving user privacy
Estimation of Dynamic Origin-Destination Matrices in a Railway Transportation Network integrating Ticket Sales and Passenger Count Data
Accurately estimating Origin-Destination (OD) matrices is a topic of
increasing interest for efficient transportation network management and
sustainable urban planning. Traditionally, travel surveys have supported this
process; however, their availability and comprehensiveness can be limited.
Moreover, the recent COVID-19 pandemic has triggered unprecedented shifts in
mobility patterns, underscoring the urgency of accurate and dynamic mobility
data supporting policies and decisions with data-driven evidence. In this
study, we tackle these challenges by introducing an innovative pipeline for
estimating dynamic OD matrices. The real motivating problem behind this is
based on the Trenord railway transportation network in Lombardy, Italy. We
apply a novel approach that integrates ticket and subscription sales data with
passenger counts obtained from Automated Passenger Counting (APC) systems,
making use of the Iterative Proportional Fitting (IPF) algorithm. Our work
effectively addresses the complexities posed by incomplete and diverse data
sources, showcasing the adaptability of our pipeline across various
transportation contexts. Ultimately, this research bridges the gap between
available data sources and the escalating need for precise OD matrices. The
proposed pipeline fosters a comprehensive grasp of transportation network
dynamics, providing a valuable tool for transportation operators, policymakers,
and researchers. Indeed, to highlight the potentiality of dynamic OD matrices,
we showcase some methods to perform anomaly detection of mobility trends in the
network through such matrices and interpret them in light of events that
happened in the last months of 2022.Comment: Codes available. Synthetic Data available. Application to train
network data. 27 pages, 6 Tables, 16 Figure
Dynamic treatment effect phenotyping through functional survival analysis
In recent years, research interest in personalised treatments has been
growing. However, treatment effect heterogeneity and possibly time-varying
treatment effects are still often overlooked in clinical studies. Statistical
tools are needed for the identification of treatment response patterns, taking
into account that treatment response is not constant over time. We aim to
provide an innovative method to obtain dynamic treatment effect phenotypes on a
time-to-event outcome, conditioned on a set of relevant effect modifiers. The
proposed method does not require the assumption of proportional hazards for the
treatment effect, which is rarely realistic. We propose a spline-based survival
neural network, inspired by the Royston-Parmar survival model, to estimate
time-varying conditional treatment effects. We then exploit the functional
nature of the resulting estimates to apply a functional clustering of the
treatment effect curves in order to identify different patterns of treatment
effects. The application that motivated this work is the discontinuation of
treatment with Mineralocorticoid receptor Antagonists (MRAs) in patients with
heart failure, where there is no clear evidence as to which patients it is the
safest choice to discontinue treatment and, conversely, when it leads to a
higher risk of adverse events. The data come from an electronic health record
database. A simulation study was performed to assess the performance of the
spline-based neural network and the stability of the treatment response
phenotyping procedure. In light of the results, the suggested approach has the
potential to support personalized medical choices by assessing unique treatment
responses in various medical contexts over a period of time
Non-parametric frailty Cox models for hierarchical time-to-event data.
We propose a novel model for hierarchical time-to-event data, for example, healthcare data in which patients are grouped by their healthcare provider. The most common model for this kind of data is the Cox proportional hazard model, with frailties that are common to patients in the same group and given a parametric distribution. We relax the parametric frailty assumption in this class of models by using a non-parametric discrete distribution. This improves the flexibility of the model by allowing very general frailty distributions and enables the data to be clustered into groups of healthcare providers with a similar frailty. A tailored Expectation-Maximization algorithm is proposed for estimating the model parameters, methods of model selection are compared, and the code is assessed in simulation studies. This model is particularly useful for administrative data in which there are a limited number of covariates available to explain the heterogeneity associated with the risk of the event. We apply the model to a clinical administrative database recording times to hospital readmission, and related covariates, for patients previously admitted once to hospital for heart failure, and we explore latent clustering structures among healthcare providers.MR
- …