Application of Principal Component Analysis to advancing digital phenotyping of plant disease in the context of limited memory for training data storage

Abstract

Despite its widespread employment as a highly efficient dimensionality reduction technique, limited research has been carried out on the advantage of Principal Component Analysis (PCA)–based compression/reconstruction of image data to machine learning-based image classification performance and storage space optimization. To address this limitation, we designed a study in which we compared the performances of two Convolutional Neural Network-Random Forest Algorithm (CNN-RF) guava leaf image classification models developed using training data from a number of original guava leaf images contained in a predefined amount of storage space (on the one hand), and a number of PCA compressed/reconstructed guava leaf images contained in the same amount of storage space (on the other hand), on the basis of four criteria – Accuracy, F1-Score, Phi Coefficient and the Fowlkes–Mallows index. Our approach achieved a 1:100 image compression ratio (99.00% image compression) which was comparatively much better than previous results achieved using other algorithms like arithmetic coding (1:1.50), wavelet transform (90.00% image compression), and a combination of three transform-based techniques – Discrete Fourier (DFT), Discrete Wavelet (DWT) and Discrete Cosine (DCT) (1:22.50). From a subjective visual quality perspective, the PCA compressed/reconstructed guava leaf images presented almost no loss of image detail. Finally, the CNN-RF model developed using PCA compressed/reconstructed guava leaf images outperformed the CNN-RF model developed using original guava leaf images by 0.10% accuracy increase, 0.10 F1-Score increase, 0.18 Phi Coefficient increase and 0.09 Fowlkes–Mallows increase

    Similar works