1 research outputs found
An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning applied to Gastrointestinal Tract Abnormality Classification
Precise and efficient automated identification of Gastrointestinal (GI) tract
diseases can help doctors treat more patients and improve the rate of disease
detection and identification. Currently, automatic analysis of diseases in the
GI tract is a hot topic in both computer science and medical-related journals.
Nevertheless, the evaluation of such an automatic analysis is often incomplete
or simply wrong. Algorithms are often only tested on small and biased datasets,
and cross-dataset evaluations are rarely performed. A clear understanding of
evaluation metrics and machine learning models with cross datasets is crucial
to bring research in the field to a new quality level. Towards this goal, we
present comprehensive evaluations of five distinct machine learning models
using Global Features and Deep Neural Networks that can classify 16 different
key types of GI tract conditions, including pathological findings, anatomical
landmarks, polyp removal conditions, and normal findings from images captured
by common GI tract examination instruments. In our evaluation, we introduce
performance hexagons using six performance metrics such as recall, precision,
specificity, accuracy, F1-score, and Matthews Correlation Coefficient to
demonstrate how to determine the real capabilities of models rather than
evaluating them shallowly. Furthermore, we perform cross-dataset evaluations
using different datasets for training and testing. With these cross-dataset
evaluations, we demonstrate the challenge of actually building a generalizable
model that could be used across different hospitals. Our experiments clearly
show that more sophisticated performance metrics and evaluation methods need to
be applied to get reliable models rather than depending on evaluations of the
splits of the same dataset, i.e., the performance metrics should always be
interpreted together rather than relying on a single metric.Comment: 30 pages, 12 figures, 8 tables, Accepted for ACM Transactions on
Computing for Healthcar