10 research outputs found

    Features-Based Deisotoping Method for Tandem Mass Spectra

    Get PDF
    For high-resolution tandem mass spectra, the determination of monoisotopic masses of fragment ions plays a key role in the subsequent peptide and protein identification. In this paper, we present a new algorithm for deisotoping the bottom-up spectra. Isotopic-cluster graphs are constructed to describe the relationship between all possible isotopic clusters. Based on the relationship in isotopic-cluster graphs, each possible isotopic cluster is assessed with a score function, which is built by combining nonintensity and intensity features of fragment ions. The non-intensity features are used to prevent fragment ions with low intensity from being removed. Dynamic programming is adopted to find the highest score path with the most reliable isotopic clusters. The experimental results have shown that the average Mascot scores and F-scores of identified peptides from spectra processed by our deisotoping method are greater than those by YADA and MS-Deconv software

    Isotopic envelope identification by analysis of the spatial distribution of components in MALDI-MSI data

    Full text link
    One of the significant steps in the process leading to the identification of proteins is mass spectrometry, which allows for obtaining information about the structure of proteins. Removing isotope peaks from the mass spectrum is vital and it is done in a process called deisotoping. There are different algorithms for deisotoping, but they have their limitations, they are dedicated to different methods of mass spectrometry. Data from experiments performed with the MALDI-ToF technique are characterized by high dimensionality. This paper presents a method for identifying isotope envelopes in MALDI-ToF molecular imaging data based on the Mamdani-Assilan fuzzy system and spatial maps of the molecular distribution of peaks included in the isotopic envelope. Several image texture measures were used to evaluate spatial molecular distribution maps. The algorithm was tested on eight datasets obtained from the MALDI-ToF experiment on samples from the National Institute of Oncology in Gliwice from patients with cancer of the head and neck region. The data were subjected to pre-processing and feature extraction. The results were collected and compared with three existing deisotoping algorithms. The analysis of the obtained results showed that the method for identifying isotopic envelopes proposed in this paper enables the detection of overlapping envelopes by using the approach oriented to study peak pairs. Moreover, the proposed algorithm enables the analysis of large data sets

    De Novo Sequencing of Peptides from High-Resolution Bottom-Up Tandem Mass Spectra using Top-Down Intended Methods

    Get PDF
    Despite high-resolution mass spectrometers are becoming accessible for more and more laboratories, tandem (MS/MS) mass spectra are still often collected at a low resolution. And even if acquired at a high resolution, software tools used for their processing do not tend to benefit from that in full, and an ability to specify a relative mass tolerance in this case often remains the only feature the respective algorithms take advantage of. We argue that a more efficient way to analyze high-resolution MS/MS spectra should be with methods more explicitly accounting for the precision level, and sustain this claim through demonstrating that a de novo sequencing framework originally developed for (high-resolution) top-down MS/MS data is perfectly suitable for processing high-resolution bottom-up datasets, even though a top-down like deconvolution performed as the first step will leave in many spectra at most a few peaks

    Filtering Methods for Mass Spectrometry-based Peptide Identification Processes

    Get PDF
    Tandem mass spectrometry (MS/MS) is a powerful tool for identifying peptide sequences. In a typical experiment, incorrect peptide identifications may result due to noise contained in the MS/MS spectra and to the low quality of the spectra. Filtering methods are widely used to remove the noise and improve the quality of the spectra before the subsequent spectra identification process. However, existing filtering methods often use features and empirically assigned weights. These weights may not reflect the reality that the contribution (reflected by weight) of each feature may vary from dataset to dataset. Therefore, filtering methods that can adapt to different datasets have the potential to improve peptide identification results. This thesis proposes two adaptive filtering methods; denoising and quality assessment, both of which improve efficiency and effectiveness of peptide identification. First, the denoising approach employs an adaptive method for picking signal peaks that is more suitable for the datasets of interest. By applying the approach to two tandem mass spectra datasets, about 66% of peaks (likely noise peaks) can be removed. The number of peptides identified later by peptide identification on those datasets increased by 14% and 23%, respectively, compared to previous work (Ding et al., 2009a). Second, the quality assessment method estimates the probabilities of spectra being high quality based on quality assessments of the individual features. The probabilities are estimated by solving a constraint optimization problem. Experimental results on two datasets illustrate that searching only the high-quality tandem spectra determined using this method saves about 56% and 62% of database searching time and loses 9% of high-quality spectra. Finally, the thesis suggests future research directions including feature selection and clustering of peptides

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph

    Protein inference based on peptides identified from tandem mass spectra

    Get PDF
    Protein inference is a critical computational step in the study of proteomics. It lays the foundation for further structural and functional analysis of proteins, based on which new medicine or technology can be developed. Today, mass spectrometry (MS) is the technique of choice for large-scale inference of proteins in proteomics. In MS-based protein inference, three levels of data are generated: (1) tandem mass spectra (MS/MS); (2) peptide sequences and their scores or probabilities; and (3) protein sequences and their scores or probabilities. Accordingly, the protein inference problem can be divided into three computational phases: (1) process MS/MS to improve the quality of the data and facilitate subsequent peptide identification; (2) postprocess peptide identification results from existing algorithms which match MS/MS to peptides; and (3) infer proteins by assembling identified peptides. The addressing of these computational problems consists of the main content of this thesis. The processing of MS/MS data mainly includes denoising, quality assessment, and charge state determination. Here, we discuss the determination of charge states from MS/MS data using low-resolution collision induced dissociation. Such spectra with multiple charges are usually searched multiple times by assuming each possible charge state. Not only does this strategy increase the overall database search time, but also yields more false positives. Hence, it is advantageous to determine the charge states of such spectra before the database search. A new approach is proposed to determine the charge states of low-resolution MS/MS. Four novel and discriminant features are adopted to describe each MS/MS and are used in Gaussian mixture model to distinguish doubly and triply charged peptides. The results have shown that this method can assign charge states to low-resolution MS/MS more accurately than existing methods. Many search engines are available for peptide identification. However, there is usually a high false positive rate (FPR) in the results. This can bring many false identifications to protein inference. As a result, it is necessary to postprocess peptide identification results. The most commonly used method is performing statistical analysis, which does not only make it possible to compare and combine the results from different search engines, but also facilitates subsequent protein inference. We proposed a new method to estimate the accuracy of peptide identification with logistic regression (LR) and exemplify it based on Sequest scores. Each peptide is characterized with the regularized Sequest scores ΔCn∗ and Xcorr∗. The score regularization is formulated as an optimization problem by applying two assumptions: the smoothing consistency between sibling peptides and the fitting consistency between original scores and new scores. The results have shown that the proposed method can robustly assign accurate probabilities to peptides and has a very high discrimination power, higher than that of PeptideProphet, to distinguish correctly and incorrectly identified peptides. Given identified peptides and their probabilities, protein inference is conducted by assembling these peptides. Existing methods to address this MS-based protein inference problem can be classified into two groups: twostage and one unified framework to identify peptides and infer proteins. In two-stage methods, protein inference is based on, but also separated from, peptide identification. Whereas in one unified framework, protein inference and peptide identification are integrated together. In this study, we proposed a unified framework for protein inference, and developed an iterative method accordingly to infer proteins based on Sequest peptide identification. The statistical analysis of peptide identification is performed with the LR previously introduced. Protein inference and peptide identification are iterated in one framework by adding a feedback from protein inference to peptide identification. The feedback information is a list of high-confidence proteins, which is used to update the adjacency matrix between peptides. The adjacency matrix is used in the regularization of peptide scores. The results have shown that the proposed method can infer more true positive proteins, while outputting less false positive proteins than ProteinProphet at the same FPR. The coverage of inferred proteins is also significantly increased due to the selection of multiple peptides for each MS/MS spectrum and the improvement of their scores by the feedback from the inferred proteins

    Algorithms for integrated analysis of glycomics and glycoproteomics by LC-MS/MS

    Get PDF
    The glycoproteome is an intricate and diverse component of a cell, and it plays a key role in the definition of the interface between that cell and the rest of its world. Methods for studying the glycoproteome have been developed for released glycan glycomics and site-localized bottom-up glycoproteomics using liquid chromatography-coupled mass spectrometry and tandem mass spectrometry (LC-MS/MS), which is itself a complex problem. Algorithms for interpreting these data are necessary to be able to extract biologically meaningful information in a high throughput, automated context. Several existing solutions have been proposed but may be found lacking for larger glycopeptides, for complex samples, different experimental conditions, different instrument vendors, or even because they simply ignore fundamentals of glycobiology. I present a series of open algorithms that approach the problem from an instrument vendor neutral, cross-platform fashion to address these challenges, and integrate key concepts from the underlying biochemical context into the interpretation process. In this work, I created a suite of deisotoping and charge state deconvolution algorithms for processing raw mass spectra at an LC scale from a variety of instrument types. These tools performed better than previously published algorithms by enforcing the underlying chemical model more strictly, while maintaining a higher degree of signal fidelity. From this summarized, vendor-normalized data, I composed a set of algorithms for interpreting glycan profiling experiments that can be used to quantify glycan expression. From this I constructed a graphical method to model the active biosynthetic pathways of the sample glycome and dig deeper into those signals than would be possible from the raw data alone. Lastly, I created a glycopeptide database search engine from these components which is capable of identifying the widest array of glycosylation types available, and demonstrate a learning algorithm which can be used to tune the model to better understand the process of glycopeptide fragmentation under specific experimental conditions to outperform a simpler model by between 10% and 15%. This approach can be further augmented with sample-wide or site-specific glycome models to increase depth-of-coverage for glycoforms consistent with prior beliefs

    Identifying protein complexes and disease genes from biomolecular networks

    Get PDF
    With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature

    MSQBAT - A Software Suite for LC-MS Protein Quantification

    Get PDF
    Accessing the relative changes in protein abundance is essential for a proper understanding of the various processes underlying disease progression and development. Nowadays, mass spectrometry-based proteomics allows for the identification of several thousand proteins in a single analysis. Unfortunately, mass spectrometry is inherently not quantitative, which is why additional techniques for protein quantification have to be developed. To measure quantitative changes in protein abundance, biological samples need either to be labeled using stable isotopes or protein abundances have to be computed using so called label-free techniques. Label-based quantification approaches are costly and the number of samples that can be quantified against each other is limited. Furthermore, depending on the sample, the introduction of the labels can be elaborate. Label-free quantification is not confronted with these limitations; principally, an unlimited number of samples can be quantified without the introduction of isotopes. Yet these advantages have their price: The development of label-free quantification algorithms is not trivial and requires profound knowledge both in bioinformatics and mass spectrometry. Namely the design of systems flexible enough to quantify data deriving from different mass spectrometric systems and proteomic workflows require additional experience and time. In order to quantify data acquired by LC-MALDI-MS, a novel software suite termed MSQBAT was developed and evaluated. MSQBAT is a platform independent software suite for MS1-based, label-free protein quantification. In contrast to other software solutions, MSQBAT is highly flexible and suited for the quantification of mass spectrometric data from various instrumental setups and proteomic workflows, such as (Ge)LC-MALDI-MS and (Ge)LC-ESI-MS. Quantification capabilities were evaluated using spike-in experiments analyzed using both different proteomic workflows and instruments. Human proteins were spiked in variable concentrations into a complex E.coli back-ground proteome and processed using both an LC-MS and a GeLC-MS approach. Samples were chromatographically separated on a nanoACQUITY UPLC system using a 120 minutes gradient and subsequently analyzed by an AB SCIEX TOF/TOF 5800 system and an AB SCIEX QTRAP 6500 system. Furthermore, a publicly available quantification benchmark data set has been used to evaluate LC-ESI-MS quantification capabilities. Obtained results show that MSQBAT can be applied to quantify data deriving from both LC-/GeLC-MALDI-MS and LC-/GeLC-ESI-MS workflows with high accuracy. Therefore, this software suite has a range of application outperforming all currently available solutions
    corecore