Deep learning (DL) can aid doctors in detecting worsening patient states
early, affording them time to react and prevent bad outcomes. While DL-based
early warning models usually work well in the hospitals they were trained for,
they tend to be less reliable when applied at new hospitals. This makes it
difficult to deploy them at scale. Using carefully harmonised intensive care
data from four data sources across Europe and the US (totalling 334,812 stays),
we systematically assessed the reliability of DL models for three common
adverse events: death, acute kidney injury (AKI), and sepsis. We tested whether
using more than one data source and/or explicitly optimising for
generalisability during training improves model performance at new hospitals.
We found that models achieved high AUROC for mortality (0.838-0.869), AKI
(0.823-0.866), and sepsis (0.749-0.824) at the training hospital. As expected,
performance dropped at new hospitals, sometimes by as much as -0.200. Using
more than one data source for training mitigated the performance drop, with
multi-source models performing roughly on par with the best single-source
model. This suggests that as data from more hospitals become available for
training, model robustness is likely to increase, lower-bounding robustness
with the performance of the most applicable data source in the training data.
Dedicated methods promoting generalisability did not noticeably improve
performance in our experiments