7 research outputs found

    Protein abundances can distinguish between naturally-occurring and laboratory strains of <i>Yersinia pestis</i>, the causative agent of plague

    No full text
    <div><p>The rapid pace of bacterial evolution enables organisms to adapt to the laboratory environment with repeated passage and thus diverge from naturally-occurring environmental (“wild”) strains. Distinguishing wild and laboratory strains is clearly important for biodefense and bioforensics; however, DNA sequence data alone has thus far not provided a clear signature, perhaps due to lack of understanding of how diverse genome changes lead to convergent phenotypes, difficulty in detecting certain types of mutations, or perhaps because some adaptive modifications are epigenetic. Monitoring protein abundance, a molecular measure of phenotype, can overcome some of these difficulties. We have assembled a collection of <i>Yersinia pestis</i> proteomics datasets from our own published and unpublished work, and from a proteomics data archive, and demonstrated that protein abundance data can clearly distinguish laboratory-adapted from wild. We developed a lasso logistic regression classifier that uses binary (presence/absence) or quantitative protein abundance measures to predict whether a sample is laboratory-adapted or wild that proved to be ~98% accurate, as judged by replicated 10-fold cross-validation. Protein features selected by the classifier accord well with our previous study of laboratory adaptation in <i>Y</i>. <i>pestis</i>. The input data was derived from a variety of unrelated experiments and contained significant confounding variables. We show that the classifier is robust with respect to these variables. The methodology is able to discover signatures for laboratory facility and culture medium that are largely independent of the signature of laboratory adaptation. Going beyond our previous laboratory evolution study, this work suggests that proteomic differences between laboratory-adapted and wild <i>Y</i>. <i>pestis</i> are general, potentially pointing to a process that could apply to other species as well. Additionally, we show that proteomics datasets (even archived data collected for different purposes) contain the information necessary to distinguish wild and laboratory samples. This work has clear applications in biomarker detection as well as biodefense.</p></div

    More protein features than those reported in Table 2 can accurately classify laboratory vs. wild samples.

    No full text
    <p>The Lasso logistic regression classifier (LRC) was constructed in iterations, with the input data for each iteration consisting of all protein features not selected by the LRC in any previous iteration. The plots show the classifier accuracy on the vertical axis plotted against the number of iterations on the horizontal axis. The number of features selected in each iteration is the plotted symbol. <b>A</b>, LRCs using quantitative protein abundance data; <b>B</b>, LRCs using presence/absence data. Note that the accuracy value in the limit of large numbers of iterations is equal to the proportion of laboratory samples in the data, and represents the limit where the features used contain no information useful for classification.</p

    Illustration of the permutation test of the final LRC generated using 10-fold cross-validation and relative abundance features.

    No full text
    <p>The red histogram represents the distribution of the accuracy generated from 10,000 permutations of the laboratory-adapted/wild labels. This histogram represents the null, distribution, i.e., the distribution expected if no information relevant to distinguishing laboratory and wild samples were present in the data. The cross-validation estimate of the accuracy of the final LRC, 99.5% is illustrated by the dashed blue line. The distance of the blue line from the null distribution clearly indicates that the observed accuracy of the LRC did not occur by chance, supporting the conclusion that the data for laboratory-adapted and wild samples is truly different. Results for the other three LRC’s (2-fold with relative abundance, 10-fold with presence/absence, and 2-fold with presence/absence) were identical to this one.</p

    Output of the LRC to distinguish wild from laboratory-adapted strains using relative protein abundance data.

    No full text
    <p>Each symbol represents the prediction of the LRC for an independent culture. Triangles represent cultures of wild strains. Circles represent laboratory-adapted strains. The horizontal axis value is the predicted probability that a culture is laboratory adapted and is non-linear; points are separated vertically in a random fashion to improve the visualization. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0183478#sec003" target="_blank">Methods</a> for an explanation of Ď„. A. Colors represent wild versus laboratory-adapted. B. Colors represent the facility of preparation and analysis. C. Colors represent the laboratory medium in which the cultures were grown prior to analysis.</p

    Signatures for Mass Spectrometry Data Quality

    No full text
    Ensuring data quality and proper instrument functionality is a prerequisite for scientific investigation. Manual quality assurance is time-consuming and subjective. Metrics for describing liquid chromatography mass spectrometry (LC–MS) data have been developed; however, the wide variety of LC–MS instruments and configurations precludes applying a simple cutoff. Using 1150 manually classified quality control (QC) data sets, we trained logistic regression classification models to predict whether a data set is in or out of control. Model parameters were optimized by minimizing a loss function that accounts for the trade-off between false positive and false negative errors. The classifier models detected bad data sets with high sensitivity while maintaining high specificity. Moreover, the composite classifier was dramatically more specific than single metrics. Finally, we evaluated the performance of the classifier on a separate validation set where it performed comparably to the results for the testing/training data sets. By presenting the methods and software used to create the classifier, other groups can create a classifier for their specific QC regimen, which is highly variable lab-to-lab. In total, this manuscript presents 3400 LC–MS data sets for the same QC sample (whole cell lysate of <i>Shewanella oneidensis</i>), deposited to the ProteomeXchange with identifiers PXD000320–PXD000324
    corecore