17 research outputs found
Interpretable Multivariate Time Series Forecasting with Temporal Attention Convolutional Neural Networks
Data in time series format, such as biological signals from medical sensors or machine signals from sensors in industrial environments are rich sources of information that can give crucial insights on the present and future condition of a person or machine. The task of predicting future values of time series has been initially approached with simple machine learning methods, and lately with deep learning. Two models that have shown good performance in this task are the temporal convolutional network and the attention module. However, despite the promising results of deep learning methods, their black-box nature makes them unsuitable for real-world applications where the predictions need to be explainable in order to be trusted. In this paper we propose an architecture comprised of a temporal convolutional network with an attention mechanism that makes predictions while presenting the timesteps of the input that were most influential for future outputs. We apply it on two datasets and we show that we gain interpretability without degrading the accuracy compared to the original temporal convolutional models. We then go one step further and we combine our configuration with various machine learning methods on top, creating a pipeline that achieves interpretability both across timesteps and input features. We use it to forecast a different variable from one of the above datasets and we study how the accuracy is affected compared to the original black-box approach
When we can trust computers (and when we can't)
With the relentless rise of computer power, there is a widespread expectation that computers can solve the most pressing problems of science, and even more besides. We explore the limits of computational modelling and conclude that, in the domains of science and engineering which are relatively simple and firmly grounded in theory, these methods are indeed powerful. Even so, the availability of code, data and documentation, along with a range of techniques for validation, verification and uncertainty quantification, are essential for building trust in computer-generated findings. When it comes to complex systems in domains of science that are less firmly grounded in theory, notably biology and medicine, to say nothing of the social sciences and humanities, computers can create the illusion of objectivity, not least because the rise of big data and machine-learning pose new challenges to reproducibility, while lacking true explanatory power. We also discuss important aspects of the natural world which cannot be solved by digital means. In the long term, renewed emphasis on analogue methods will be necessary to temper the excessive faith currently placed in digital computation. This article is part of the theme issue 'Reliability and reproducibility in computational science: implementing verification, validation and uncertainty quantification in silico'
Reproducibility in Multiple Instance Learning: A Case For Algorithmic Unit Tests
Multiple Instance Learning (MIL) is a sub-domain of classification problems
with positive and negative labels and a "bag" of inputs, where the label is
positive if and only if a positive element is contained within the bag, and
otherwise is negative. Training in this context requires associating the
bag-wide label to instance-level information, and implicitly contains a causal
assumption and asymmetry to the task (i.e., you can't swap the labels without
changing the semantics). MIL problems occur in healthcare (one malignant cell
indicates cancer), cyber security (one malicious executable makes an infected
computer), and many other tasks. In this work, we examine five of the most
prominent deep-MIL models and find that none of them respects the standard MIL
assumption. They are able to learn anti-correlated instances, i.e., defaulting
to "positive" labels until seeing a negative counter-example, which should not
be possible for a correct MIL model. We suspect that enhancements and other
works derived from these models will share the same issue. In any context in
which these models are being used, this creates the potential for learning
incorrect models, which creates risk of operational failure. We identify and
demonstrate this problem via a proposed "algorithmic unit test", where we
create synthetic datasets that can be solved by a MIL respecting model, and
which clearly reveal learning that violates MIL assumptions. The five evaluated
methods each fail one or more of these tests. This provides a model-agnostic
way to identify violations of modeling assumptions, which we hope will be
useful for future development and evaluation of MIL models.Comment: To appear in the 37th Conference on Neural Information Processing
Systems (NeurIPS 2023
Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Shape Reconstruction
Various deep learning models have been proposed for 3D bone shape
reconstruction from two orthogonal (biplanar) X-ray images. However, it is
unclear how these models compare against each other since they are evaluated on
different anatomy, cohort and (often privately held) datasets. Moreover, the
impact of the commonly optimized image-based segmentation metrics such as dice
score on the estimation of clinical parameters relevant in 2D-3D bone shape
reconstruction is not well known. To move closer toward clinical translation,
we propose a benchmarking framework that evaluates tasks relevant to real-world
clinical scenarios, including reconstruction of fractured bones, bones with
implants, robustness to population shift, and error in estimating clinical
parameters. Our open-source platform provides reference implementations of 8
models (many of whose implementations were not publicly available), APIs to
easily collect and preprocess 6 public datasets, and the implementation of
automatic clinical parameter and landmark extraction methods. We present an
extensive evaluation of 8 2D-3D models on equal footing using 6 public datasets
comprising images for four different anatomies. Our results show that
attention-based methods that capture global spatial relationships tend to
perform better across all anatomies and datasets; performance on clinically
relevant subgroups may be overestimated without disaggregated reporting; ribs
are substantially more difficult to reconstruct compared to femur, hip and
spine; and the dice score improvement does not always bring a corresponding
improvement in the automatic estimation of clinically relevant parameters.Comment: accepted to NeurIPS 202
On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices
While learning with limited labelled data can improve performance when the
labels are lacking, it is also sensitive to the effects of uncontrolled
randomness introduced by so-called randomness factors (e.g., varying order of
data). We propose a method to systematically investigate the effects of
randomness factors while taking the interactions between them into
consideration. To measure the true effects of an individual randomness factor,
our method mitigates the effects of other factors and observes how the
performance varies across multiple runs. Applying our method to multiple
randomness factors across in-context learning and fine-tuning approaches on 7
representative text classification tasks and meta-learning on 3 tasks, we show
that: 1) disregarding interactions between randomness factors in existing works
caused inconsistent findings due to incorrect attribution of the effects of
randomness factors, such as disproving the consistent sensitivity of in-context
learning to sample order even with random sample selection; and 2) besides
mutual interactions, the effects of randomness factors, especially sample
order, are also dependent on more systematic choices unexplored in existing
works, such as number of classes, samples per class or choice of prompt format