Evaluation Of The C4.5 Decision Tree and Random Forest Classification Algorithms in Predicting Diabetes

Abstract

This study investigates diabetes prediction as a binary classification task using the C4.5 Decision Tree and Random Forest algorithms on the Pima Indians Diabetes dataset. The objective of this study is to compare the performance of both algorithms under three reported experimental settings: without data balancing, with data balancing, and with hyperparameter tuning without balancing. The dataset consists of 768 records, including 500 non-diabetes cases and 268 diabetes cases. The preprocessing stage included data cleaning, Box-Cox transformation, min-max normalization, feature selection, and data splitting into 80% training data and 20% test data. Model performance was evaluated using accuracy, precision, recall, and F1-score through 3-fold, 5-fold, and 9-fold cross validation. The results show that Random Forest consistently outperformed the C4.5 Decision Tree across all reported settings. Under the non-balancing condition, Random Forest achieved the highest accuracy of 77.82%, while C4.5 achieved 69.65%. After applying data balancing, the performance of both models improved, with Random Forest achieving the best overall reported accuracy of 84.19%, compared with 75.68% for C4.5. Under hyperparameter tuning without balancing, Random Forest achieved 78.18%, while C4.5 achieved 74.18%. These findings indicate that Random Forest is more robust and effective than the C4.5 Decision Tree for diabetes prediction, and that data balancing contributes more significantly to performance improvement than hyperparameter tuning alone.

Similar works

Full text

This paper was published in Jurnal Elektro.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.