Molecular Descriptor Subset Selection in Theoretical
Peptide Quantitative Structure–Retention Relationship Model
Development Using Nature-Inspired
Optimization Algorithms

J. Jay Liu (1517116); Katarzyna Macur (1517119); Petar Žuvela (1517113); Tomasz Bączek (1426912)

Molecular Descriptor Subset Selection in Theoretical Peptide Quantitative Structure–Retention Relationship Model Development Using Nature-Inspired Optimization Algorithms

Authors: J. Jay Liu (1517116)
Katarzyna Macur (1517119)
Petar Žuvela (1517113)
Tomasz Bączek (1426912)
Publication date
Publisher
Doi

Abstract

In this work, performance of five nature-inspired optimization algorithms, genetic algorithm (GA), particle swarm optimization (PSO), artificial bee colony (ABC), firefly algorithm (FA), and flower pollination algorithm (FPA), was compared in molecular descriptor selection for development of quantitative structure–retention relationship (QSRR) models for 83 peptides that originate from eight model proteins. The matrix with 423 descriptors was used as input, and QSRR models based on selected descriptors were built using partial least squares (PLS), whereas root mean square error of prediction (RMSEP) was used as a fitness function for their selection. Three performance criteria, prediction accuracy, computational cost, and the number of selected descriptors, were used to evaluate the developed QSRR models. The results show that all five variable selection methods outperform interval PLS (iPLS), sparse PLS (sPLS), and the full PLS model, whereas GA is superior because of its lowest computational cost and higher accuracy (RMSEP of 5.534%) with a smaller number of variables (nine descriptors). The GA-QSRR model was validated initially through Y-randomization. In addition, it was successfully validated with an external testing set out of 102 peptides originating from <i>Bacillus subtilis</i> proteomes (RMSEP of 22.030%). Its applicability domain was defined, from which it was evident that the developed GA-QSRR exhibited strong robustness. All the sources of the model’s error were identified, thus allowing for further application of the developed methodology in proteomics

Similar works

Full text

Available Versions

FigShare

oai:figshare.com:article/21250...

Last time updated on 12/02/2018