Quantitative Structure–Retention Relationship
Models To Support Nontarget High-Resolution Mass Spectrometric Screening
of Emerging Contaminants in Environmental Samples
Over the past decade, the application
of liquid chromatography-high
resolution mass spectroscopy (LC-HRMS) has been growing extensively
due to its ability to analyze a wide range of suspected and unknown
compounds in environmental samples. However, various criteria, such
as mass accuracy and isotopic pattern of the precursor ion, MS/MS
spectra evaluation, and retention time plausibility, should be met
to reach a certain identification confidence. In this context, a comprehensive
workflow based on computational tools was developed to understand
the retention time behavior of a large number of compounds belonging
to emerging contaminants. Two extensive data sets were built for two
chromatographic systems, one for positive and one for negative electrospray
ionization mode, containing information for the retention time of
528 and 298 compounds, respectively, to expand the applicability domain
of the developed models. Then, the data sets were split into training
and test set, employing <i>k</i>-nearest neighborhood clustering,
to build and validate the models’ internal and external prediction
ability. The best subset of molecular descriptors was selected using
genetic algorithms. Multiple linear regression, artificial neural
networks, and support vector machines were used to correlate the selected
descriptors with the experimental retention times. Several validation
techniques were used, including Golbraikh–Tropsha acceptable
model criteria, Euclidean based applicability domain, modified correlation
coefficient (<i>r</i><sub>m</sub><sup>2</sup>), and concordance correlation coefficient
values, to measure the accuracy and precision of the models. The best
linear and nonlinear models for each data set were derived and used
to predict the retention time of suspect compounds of a wide-scope
survey, as the evaluation data set. For the efficient outlier detection
and interpretation of the origin of the prediction error, a novel
procedure and tool was developed and applied, enabling us to identify
if the suspect compound was in the applicability domain or not