Cancer Classification Challenges in High-Dimensional Microarray Data: An In-Depth Exploration of Machine Learning Models

Wafa' Qasim Al-Jamal; Sakinah Ali Pitchay; Farida Hazwani Mohd Ridzuan; Muhammad Harith Noor Azam

Search results>Research output from USIM Research Repository System (SRP)

journal article

oai:oarep.usim.edu.my:123456789/29472

Cancer Classification Challenges in High-Dimensional Microarray Data: An In-Depth Exploration of Machine Learning Models

Authors: Wafa' Qasim Al-Jamal
Sakinah Ali Pitchay
Farida Hazwani Mohd Ridzuan
Muhammad Harith Noor Azam
Publication date: 1 January 2026
Publisher: Usim Press
Doi

Abstract

Indexed by MyCiteMicroarray gene expression profiling has transformed biomedical research by enabling large-scale, parallel analysis of thousands of genes. Despite its promise, cancer classification using Machine Learning (ML) on microarray data continues to face critical challenges, particularly due to high dimensionality, limited sample sizes, and severe class imbalance. These factorscontribute to overfitting, poor generalization, and inflated performance metrics, hinderingthe clinical translationof models. This Structured LiteratureReview (SLR) examines ML-based cancer classification studies published between 2015 and 2025. This period was marked by the emergence of deep learning, synthetic data generation, and biologically informed modeling. Using a transparent selection protocol, we synthesize findings from over 20 peer-reviewed studies. The review focuses on three methodological pillars: biologically grounded feature selection, constrained data augmentation, and robust performance evaluation. We identify a growing trend toward hybrid feature selection methods that balance statistical relevance and biological interpretability. However, comparative benchmarking across datasets remains limited. Data augmentation techniques, such asSynthetic Minority Oversampling Technique(SMOTE) and Generative Adversarial Networks(GAN)s, are increasingly being adopted. However, they often lack biological validation. This raises concerns about the plausibility of synthetic gene profiles. To address this, we recommend integratingpathway-level constraints and gene ontology checks during the augmentation process. Furthermore, we observe that many studies disproportionately emphasize accuracy. This can misrepresent the model's efficacyin imbalanced settings. Metrics such as Matthews Correlation Coefficient (MCC), F1-score, and precision-recall curves offer more reliable insights. These metrics should be standardized across evaluations. External validation using independent datasets is also essential to assess generalizability.In addition, ithelps mitigate dataset-specific bias. Based on the findings, we present a conceptual hybrid framework that integrates biologically informed feature selection, biologically constrained data augmentation, and balanced evaluation protocols. This framework is intended to enhance reproducibility, biological fidelity, and translational reliability in machine learning-based cancer diagnostics, thereby contributing to the advancement of precision oncology.Indexed publicatio

Similar works

Full text

USIM Research Repository System (SRP)

oai:oarep.usim.edu.my:12345678...

Last time updated on 28/04/2026

This paper was published in USIM Research Repository System (SRP).

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.