44 research outputs found
Challenge Results Are Not Reproducible
While clinical trials are the state-of-the-art methods to assess the effect
of new medication in a comparative manner, benchmarking in the field of medical
image analysis is performed by so-called challenges. Recently, comprehensive
analysis of multiple biomedical image analysis challenges revealed large
discrepancies between the impact of challenges and quality control of the
design and reporting standard. This work aims to follow up on these results and
attempts to address the specific question of the reproducibility of the
participants methods. In an effort to determine whether alternative
interpretations of the method description may change the challenge ranking, we
reproduced the algorithms submitted to the 2019 Robust Medical Image
Segmentation Challenge (ROBUST-MIS). The leaderboard differed substantially
between the original challenge and reimplementation, indicating that challenge
rankings may not be sufficiently reproducible.Comment: Accepted at BVM 202
Recommended from our members
Optical detection of di- and triphosphate anions with mixed monolayer-protected gold nanoparticles containing zinc(II)–dipicolylamine complexes
Gold nanoparticles covered with a mixture of ligands of which one type contains solubilizing triethylene glycol residues and the other peripheral zinc(II)–dipicolylamine (DPA) complexes allowed the optical detection of hydrogenphosphate, diphosphate, and triphosphate anions in water/methanol 1:2 (v/v). These anions caused the bright red solutions of the nanoparticles to change their color because of nanoparticle aggregation followed by precipitation, whereas halides or oxoanions such as sulfate, nitrate, or carbonate produced no effect. The sensitivity of phosphate sensing depended on the nature of the anion, with diphosphate and triphosphate inducing visual changes at significantly lower concentrations than hydrogenphosphate. In addition, the sensing sensitivity was also affected by the ratio of the ligands on the nanoparticle surface, decreasing as the number of immobilized zinc(II)–dipicolylamine groups increased. A nanoparticle containing a 9:1 ratio of the solubilizing and the anion-binding ligand showed a color change at diphosphate and triphosphate concentrations as low as 10 μmol/L, for example, and precipitated at slightly higher concentrations. Hydrogenphosphate induced a nanoparticle precipitation only at a concentration of ca. 400 μmol/L, at which the precipitates formed in the presence of diphosphates and triphosphates redissolved. A nanoparticle containing fewer binding sites was more sensitive, while increasing the relative number of zinc(II)–dipicolylamine complexes beyond 25% had a negative impact on the limit of detection and the optical response. Transmission electron microscopy provided evidence that the changes of the nanoparticle properties observed in the presence of the phosphates were due to a nanoparticle crosslinking, consistent with the preferred binding mode of zinc(II)–dipicolylamine complexes with phosphate anions which involves binding of the anion between two metal centers. This work thus provided information on how the behavior of mixed monolayer-protected gold nanoparticles is affected by multivalent interactions, at the same time introducing a method to assess whether certain biologically relevant anions are present in an aqueous solution within a specific concentration range
Deployment of Image Analysis Algorithms under Prevalence Shifts
Domain gaps are among the most relevant roadblocks in the clinical
translation of machine learning (ML)-based solutions for medical image
analysis. While current research focuses on new training paradigms and network
architectures, little attention is given to the specific effect of prevalence
shifts on an algorithm deployed in practice. Such discrepancies between class
frequencies in the data used for a method's development/validation and that in
its deployment environment(s) are of great importance, for example in the
context of artificial intelligence (AI) democratization, as disease prevalences
may vary widely across time and location. Our contribution is twofold. First,
we empirically demonstrate the potentially severe consequences of missing
prevalence handling by analyzing (i) the extent of miscalibration, (ii) the
deviation of the decision threshold from the optimum, and (iii) the ability of
validation metrics to reflect neural network performance on the deployment
population as a function of the discrepancy between development and deployment
prevalence. Second, we propose a workflow for prevalence-aware image
classification that uses estimated deployment prevalences to adjust a trained
classifier to a new environment, without requiring additional annotated
deployment data. Comprehensive experiments based on a diverse set of 30 medical
classification tasks showcase the benefit of the proposed workflow in
generating better classifier decisions and more reliable performance estimates
compared to current practice
Ten years of image analysis and machine learning competitions in dementia
Machine learning methods exploiting multi-parametric biomarkers, especially
based on neuroimaging, have huge potential to improve early diagnosis of
dementia and to predict which individuals are at-risk of developing dementia.
To benchmark algorithms in the field of machine learning and neuroimaging in
dementia and assess their potential for use in clinical practice and clinical
trials, seven grand challenges have been organized in the last decade.
The seven grand challenges addressed questions related to screening, clinical
status estimation, prediction and monitoring in (pre-clinical) dementia. There
was little overlap in clinical questions, tasks and performance metrics.
Whereas this aids providing insight on a broad range of questions, it also
limits the validation of results across challenges. The validation process
itself was mostly comparable between challenges, using similar methods for
ensuring objective comparison, uncertainty estimation and statistical testing.
In general, winning algorithms performed rigorous data preprocessing and
combined a wide range of input features.
Despite high state-of-the-art performances, most of the methods evaluated by
the challenges are not clinically used. To increase impact, future challenges
could pay more attention to statistical analysis of which factors relate to
higher performance, to clinical questions beyond Alzheimer's disease, and to
using testing data beyond the Alzheimer's Disease Neuroimaging Initiative.
Grand challenges would be an ideal venue for assessing the generalizability of
algorithm performance to unseen data of other cohorts. Key for increasing
impact in this way are larger testing data sizes, which could be reached by
sharing algorithms rather than data to exploit data that cannot be shared.Comment: 12 pages, 4 table
BIAS: Transparent reporting of biomedical image analysis challenges
The number of biomedical image analysis challenges organized per year is steadily increasing. These international competitions have the purpose of benchmarking algorithms on common data sets, typically to identify the best method for a given problem. Recent research, however, revealed that common practice related to challenge reporting does not allow for adequate interpretation and reproducibility of results. To address the discrepancy between the impact of challenges and the quality (control), the Biomedical Image Analysis ChallengeS (BIAS) initiative developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. This article describes how the BIAS statement was developed and presents a checklist which authors of biomedical image analysis challenges are encouraged to include in their submission when giving a paper on a challenge into review. The purpose of the checklist is to standardize and facilitate the review process and raise interpretability and reproducibility of challenge results by making relevant information explicit