image

A hybrid vision transformer with ensemble CNN framework for cervical cancer diagnosis

Abstract

Cervical cancer is the leading cause of cancer-related deaths among women worldwide, necessitating early and accurate detection methods. This study introduces a hybrid framework utilizing Vision Transformers (ViT) and ensemble learning-based convolutional neural networks (CNN) models for cervical cancer classification based on Pap smear images. Two prominent datasets, Mendeley LBC and SIPaKMeD, are employed, encompassing nine distinct categories of cervical cell abnormalities. The proposed approach integrates pre-trained CNN models of DenseNet201, Xception, and InceptionResNetV2 to extract high-level features, further fused through ensemble learning. These features are then processed by the ViT-based encoder model designed for improved interpretability and accuracy. Experimental results demonstrate that the hybrid model achieves superior accuracy rates of 97.26%, a recall of 97.27%, a precision of 97.27%, and 96.70% for the F1-score on the Mendeley LBC dataset. For the SIPaKMeD dataset, there was an accuracy of 99.18%, a recall of 99.18%, a precision of 99.15%, and a 99.21% F1-score. On the combined dataset, the model outperformed individual pre-trained models with 95.10% accuracy and a 95.01% F1-score. Moreover, the framework incorporates augmentation with Explainable AI (XAI) techniques, specifically Grad-CAM, to provide transparent and interpretable diagnostic outcomes, enhancing its utility in clinical settings. This research underscores the potential of hybrid AI frameworks in revolutionizing cervical cancer diagnostics by offering accurate, efficient, and interpretable solutions

    Similar works