Prokaryote growth temperature prediction with machine learning

Abstract

Archaea and bacteria can be divided into four groups based on their growth temperature adaptation: mesophiles, thermophiles, hyperthermophiles, and psychrophiles. The thermostability of proteins is a sum of multiple different physical forces such as van der Waals interactions, chemical polarity, and ionic interactions. Genes causing the adaptation have not been identified and this thesis aims to identify temperature adaptation linked genes and predict temperature adaptation based on the absence or presence of genes. A dataset of 4361 genes from 711 prokaryotes was analyzed with four different machine learning algorithms: neural network, random forest, gradient boosting machine, and logistic regression. Logistic regression was chosen to be an explanatory and predictive model based on micro averaged AUC and Occam’s razor principle. Logistic regression was able to predict temperature adaptation with good performance. Machine learning is a powerful predictor for temperature adaptation and less than 200 genes were needed for the prediction of each adaptation. This technique can be used to predict the adaptation of uncultivated prokaryotes. However, the statistical importance of genes connected to temperature adaptation was not verified and this thesis did not provide much additional support for previously proposed temperature adaptation linked genes

    Similar works