15 research outputs found
SANDPUMA: ensemble predictions of nonribosomal peptide chemistry reveal biosynthetic diversity across Actinobacteria
Nonribosomally synthesized peptides (NRPs) are natural products with widespread applications in medicine and biotechnology. Many algorithms have been developed to predict the substrate specificities of nonribosomal peptide synthetase adenylation (A) domains from DNA sequences, which enables prioritization and dereplication, and integration with other data types in discovery efforts. However, insufficient training data and a lack of clarity regarding prediction quality have impeded optimal use. Here, we introduce prediCAT, a new phylogenetics-inspired algorithm, which quantitatively estimates the degree of predictability of each A-domain. We then systematically benchmarked all algorithms on a newly gathered, independent test set of 434 A-domain sequences, showing that active-site-motif-based algorithms outperform whole-domain-based methods. Subsequently, we developed SANDPUMA, a powerful ensemble algorithm, based on newly trained versions of all high-performing algorithms, which significantly outperforms individual methods. Finally, we deployed SANDPUMA in a systematic investigation of 7635 Actinobacteria genomes, suggesting that NRP chemical diversity is much higher than previously estimated. SANDPUMA has been integrated into the widely used antiSMASH biosynthetic gene cluster analysis pipeline and is also available as an open-source, standalone tool
The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity
Retention Time Prediction Improves Identification in Non-Targeted Lipidomics Approaches
Identification
of lipids in nontargeted lipidomics based on liquid-chromatography
coupled to mass spectrometry (LC-MS) is still a major issue. While
both accurate mass and fragment spectra contain valuable information,
retention time (<i>t</i><sub>R</sub>) information can be
used to augment this data. We present a retention time model based
on machine learning approaches which enables an improved assignment
of lipid structures and automated annotation of lipidomics data. In
contrast to common approaches we used a complex mixture of 201 lipids
originating from fat tissue instead of a standard mixture to train
a support vector regression (SVR) model including molecular structural
features. The cross-validated model achieves a correlation coefficient
between predicted and experimental test sample retention times of <i>r</i> = 0.989. Combining our retention time model with identification
via accurate mass search (AMS) of lipids against the comprehensive
LIPID MAPS database, retention time filtering can significantly reduce
the rate of false positives in complex data sets like adipose tissue
extracts. In our case, filtering with retention time information removed
more than half of the potential identifications, while retaining 95%
of the correct identifications. Combination of high-precision retention
time prediction and accurate mass can thus significantly narrow down
the number of hypotheses to be assessed for lipid identification in
complex lipid pattern like tissue profiles
LFQProfiler and RNP<sup>xl</sup>: Open-Source Tools for Label-Free Quantification and Protein–RNA Cross-Linking Integrated into Proteome Discoverer
Modern mass spectrometry
setups used in today’s proteomics
studies generate vast amounts of raw data, calling for highly efficient
data processing and analysis tools. Software for analyzing these data
is either monolithic (easy to use, but sometimes too rigid) or workflow-driven
(easy to customize, but sometimes complex). Thermo Proteome Discoverer
(PD) is a powerful software for workflow-driven data analysis in proteomics
which, in our eyes, achieves a good trade-off between flexibility
and usability. Here, we present two open-source plugins for PD providing
additional functionality: LFQProfiler for label-free quantification
of peptides and proteins, and RNP<sup>xl</sup> for UV-induced peptide–RNA
cross-linking data analysis. LFQProfiler interacts with existing PD
nodes for peptide identification and validation and takes care of
the entire quantitative part of the workflow. We show that it performs
at least on par with other state-of-the-art software solutions for
label-free quantification in a recently published benchmark (Ramus, C.; J. Proteomics 2016, 132, 51–62). The second workflow, RNP<sup>xl</sup>, represents
the first software solution to date for identification of peptide–RNA
cross-links including automatic localization of the cross-links at
amino acid resolution and localization scoring. It comes with a customized
integrated cross-link fragment spectrum viewer for convenient manual
inspection and validation of the results
The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity
International audienc