11,399 research outputs found

    Protein inference based on peptides identified from tandem mass spectra

    Get PDF
    Protein inference is a critical computational step in the study of proteomics. It lays the foundation for further structural and functional analysis of proteins, based on which new medicine or technology can be developed. Today, mass spectrometry (MS) is the technique of choice for large-scale inference of proteins in proteomics. In MS-based protein inference, three levels of data are generated: (1) tandem mass spectra (MS/MS); (2) peptide sequences and their scores or probabilities; and (3) protein sequences and their scores or probabilities. Accordingly, the protein inference problem can be divided into three computational phases: (1) process MS/MS to improve the quality of the data and facilitate subsequent peptide identification; (2) postprocess peptide identification results from existing algorithms which match MS/MS to peptides; and (3) infer proteins by assembling identified peptides. The addressing of these computational problems consists of the main content of this thesis. The processing of MS/MS data mainly includes denoising, quality assessment, and charge state determination. Here, we discuss the determination of charge states from MS/MS data using low-resolution collision induced dissociation. Such spectra with multiple charges are usually searched multiple times by assuming each possible charge state. Not only does this strategy increase the overall database search time, but also yields more false positives. Hence, it is advantageous to determine the charge states of such spectra before the database search. A new approach is proposed to determine the charge states of low-resolution MS/MS. Four novel and discriminant features are adopted to describe each MS/MS and are used in Gaussian mixture model to distinguish doubly and triply charged peptides. The results have shown that this method can assign charge states to low-resolution MS/MS more accurately than existing methods. Many search engines are available for peptide identification. However, there is usually a high false positive rate (FPR) in the results. This can bring many false identifications to protein inference. As a result, it is necessary to postprocess peptide identification results. The most commonly used method is performing statistical analysis, which does not only make it possible to compare and combine the results from different search engines, but also facilitates subsequent protein inference. We proposed a new method to estimate the accuracy of peptide identification with logistic regression (LR) and exemplify it based on Sequest scores. Each peptide is characterized with the regularized Sequest scores ΔCn∗ and Xcorr∗. The score regularization is formulated as an optimization problem by applying two assumptions: the smoothing consistency between sibling peptides and the fitting consistency between original scores and new scores. The results have shown that the proposed method can robustly assign accurate probabilities to peptides and has a very high discrimination power, higher than that of PeptideProphet, to distinguish correctly and incorrectly identified peptides. Given identified peptides and their probabilities, protein inference is conducted by assembling these peptides. Existing methods to address this MS-based protein inference problem can be classified into two groups: twostage and one unified framework to identify peptides and infer proteins. In two-stage methods, protein inference is based on, but also separated from, peptide identification. Whereas in one unified framework, protein inference and peptide identification are integrated together. In this study, we proposed a unified framework for protein inference, and developed an iterative method accordingly to infer proteins based on Sequest peptide identification. The statistical analysis of peptide identification is performed with the LR previously introduced. Protein inference and peptide identification are iterated in one framework by adding a feedback from protein inference to peptide identification. The feedback information is a list of high-confidence proteins, which is used to update the adjacency matrix between peptides. The adjacency matrix is used in the regularization of peptide scores. The results have shown that the proposed method can infer more true positive proteins, while outputting less false positive proteins than ProteinProphet at the same FPR. The coverage of inferred proteins is also significantly increased due to the selection of multiple peptides for each MS/MS spectrum and the improvement of their scores by the feedback from the inferred proteins

    DART-ID increases single-cell proteome coverage.

    Get PDF
    Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30-50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at http://dart-id.slavovlab.net

    Proteomics as a quality control tool of pharmaceutical probiotic bacterial lysate products

    Get PDF
    Probiotic bacteria have a wide range of applications in veterinary and human therapeutics. Inactivated probiotics are complex samples and quality control (QC) should measure as many molecular features as possible. Capillary electrophoresis coupled to mass spectrometry (CE/MS) has been used as a multidimensional and high throughput method for the identification and validation of biomarkers of disease in complex biological samples such as biofluids. In this study we evaluate the suitability of CE/MS to measure the consistency of different lots of the probiotic formulation Pro-Symbioflor which is a bacterial lysate of heat-inactivated Escherichia coli and Enterococcus faecalis. Over 5000 peptides were detected by CE/MS in 5 different lots of the bacterial lysate and in a sample of culture medium. 71 to 75% of the total peptide content was identical in all lots. This percentage increased to 87–89% when allowing the absence of a peptide in one of the 5 samples. These results, based on over 2000 peptides, suggest high similarity of the 5 different lots. Sequence analysis identified peptides of both E. coli and E. faecalis and peptides originating from the culture medium, thus confirming the presence of the strains in the formulation. Ontology analysis suggested that the majority of the peptides identified for E. coli originated from the cell membrane or the fimbrium, while peptides identified for E. faecalis were enriched for peptides originating from the cytoplasm. The bacterial lysate peptides as a whole are recognised as highly conserved molecular patterns by the innate immune system as microbe associated molecular pattern (MAMP). Sequence analysis also identified the presence of soybean, yeast and casein protein fragments that are part of the formulation of the culture medium. In conclusion CE/MS seems an appropriate QC tool to analyze complex biological products such as inactivated probiotic formulations and allows determining the similarity between lots

    Programmed cell death 6 interacting protein (PDCD6IP) and Rabenosyn-5 (ZFYVE20) are potential urinary biomarkers for upper gastrointestinal cancer

    Get PDF
    PURPOSE: Cancer of the upper digestive tract (uGI) is a major contributor to cancer-related death worldwide. Due to a rise in occurrence, together with poor survival rates and a lack of diagnostic or prognostic clinical assays, there is a clear need to establish molecular biomarkers. EXPERIMENTAL DESIGN: Initial assessment was performed on urine samples from 60 control and 60 uGI cancer patients using MS to establish a peak pattern or fingerprint model, which was validated by a further set of 59 samples. RESULTS: We detected 86 cluster peaks by MS above frequency and detection thresholds. Statistical testing and model building resulted in a peak profiling model of five relevant peaks with 88% overall sensitivity and 91% specificity, and overall correctness of 90%. High-resolution MS of 40 samples in the 2-10 kDa range resulted in 646 identified proteins, and pattern matching identified four of the five model peaks within significant parameters, namely programmed cell death 6 interacting protein (PDCD6IP/Alix/AIP1), Rabenosyn-5 (ZFYVE20), protein S100A8, and protein S100A9, of which the first two were validated by Western blotting. CONCLUSIONS AND CLINICAL RELEVANCE: We demonstrate that MS analysis of human urine can identify lead biomarker candidates in uGI cancers, which makes this technique potentially useful in defining and consolidating biomarker patterns for uGI cancer screening

    Proteomic profile of KSR1-regulated signalling in response to genotoxic agents in breast cancer

    Get PDF
    Kinase suppressor of Ras 1 (KSR1) has been implicated in tumorigenesis in multiple cancers, including skin, pancreatic and lung carcinomas. However, our recent study revealed a role of KSR1 as a tumour suppressor in breast cancer, the expression of which is potentially correlated with chemotherapy response. Here, we aimed to further elucidate the KSR1-regulated signalling in response to genotoxic agents in breast cancer. Stable isotope labelling by amino acids in cell culture (SILAC) coupled to high-resolution mass spectrometry (MS) was implemented to globally characterise cellular protein levels induced by KSR1 in the presence of doxorubicin or etoposide. The acquired proteomic signature was compared and GO-STRING analysis was subsequently performed to illustrate the activated functional signalling networks. Furthermore, the clinical associations of KSR1 with identified targets and their relevance in chemotherapy response were examined in breast cancer patients. We reveal a comprehensive repertoire of thousands of proteins identified in each dataset and compare the unique proteomic profiles as well as functional connections modulated by KSR1 after doxorubicin (Doxo-KSR1) or etoposide (Etop-KSR1) stimulus. From the up-regulated top hits, several proteins, including STAT1, ISG15 and TAP1 are also found to be positively associated with KSR1 expression in patient samples. Moreover, high KSR1 expression, as well as high abundance of these proteins, is correlated with better survival in breast cancer patients who underwent chemotherapy. In aggregate, our data exemplify a broad functional network conferred by KSR1 with genotoxic agents and highlight its implication in predicting chemotherapy response in breast cancer

    Prediction of impending type 1 diabetes through automated dual-label measurement of proinsulin:C-peptide ratio

    Get PDF
    Background : The hyperglycemic clamp test, the gold standard of beta cell function, predicts impending type 1 diabetes in islet autoantibody-positive individuals, but the latter may benefit from less invasive function tests such as the proinsulin: C-peptide ratio (PI:C). The present study aims to optimize precision of PI:C measurements by automating a dual-label trefoil-type time-resolved fluorescence immunoassay (TT-TRFIA), and to compare its diagnostic performance for predicting type 1 diabetes with that of clamp-derived C-peptide release. Methods : Between-day imprecision (n = 20) and split-sample analysis (n = 95) were used to compare TT-TRFIA (Auto Delfia, Perkin-Elmer) with separate methods for proinsulin (in-house TRFIA) and C-peptide (Elecsys, Roche). High-risk multiple autoantibody-positive firstdegree relatives (n = 49; age 5-39) were tested for fasting PI:C, HOMA2-IR and hyperglycemic clamp and followed for 20-57 months (interquartile range). Results : TT-TRFIA values for proinsulin, C-peptide and PI:C correlated significantly (r(2) = 0.96-0.99; P<0.001) with results obtained with separate methods. TT-TRFIA achieved better between-day % CV for PI:C at three different levels (4.5-7.1 vs 6.7-9.5 for separate methods). In high-risk relatives fasting PI:C was significantly and inversely correlated ( r(s) = -0.596; P<0.001) with first-phase C-peptide release during clamp ( also with second phase release, only available for age 12-39 years; n = 31), but only after normalization for HOMA2-IR. In ROC- and Cox regression analysis, HOMA2-IR-corrected PI:C predicted 2-year progression to diabetes equally well as clamp-derived C-peptide release. Conclusions : The reproducibility of PI:C benefits from the automated simultaneous determination of both hormones. HOMA2-IR-corrected PI:C may serve as a minimally invasive alternative to the more tedious hyperglycemic clamp test

    Score regularization for peptide identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptide-spectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptide-spectrum pairs. Thus, it is critical to develop new post-processing techniques that can distinguish true identifications from false identifications effectively.</p> <p>Results</p> <p>In this paper, we present a consistency-based PSM re-ranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this re-ranking method improves the identification performance.</p> <p>Conclusions</p> <p>The score regularization method can be used as a general post-processing step for improving peptide identifications. Source codes and data sets are available at: <url>http://bioinformatics.ust.hk/SRPI.rar</url>.</p

    Latent protein trees

    Get PDF
    Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS639 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore