2,608 research outputs found

    An unsupervised machine learning method for assessing quality of tandem mass spectra

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets.</p> <p>Results</p> <p>This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra.</p> <p>Conclusions</p> <p>Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.</p

    Computational Analysis of Mass Spectrometric Data for Whole Organism Proteomic Studies

    Get PDF
    In the last decades great breakthroughs have been achieved in the study of the genomes, supplying us with the vast knowledge of the genes and a large number of sequenced organisms. With the availability of genome information, the new systematic studies have arisen. One of the most prominent areas is proteomics. Proteomics is a discipline devoted to the study of the organism’s expressed protein content. Proteomics studies are concerned with a wide range of problems. Some of the major proteomics focuses upon the studies of protein expression patterns, the detection of protein-protein interactions, protein quantitation, protein localization analysis, and characterization of post-translational modifications. The emergence of proteomics shows great promise to furthering our understanding of the cellular processes and mechanisms of life. One of the main techniques used for high-throughput proteomic studies is mass spectrometry. Capable of detecting masses of biological compounds in complex mixtures, it is currently one of the most powerful methods for protein characterization. New horizons are opening with the new developments of mass spectrometry instrumentation, which can now be applied to a variety of proteomic problems. One of the most popular applications of proteomics involves whole organism high-throughput experiments. However, as new instrumentation is being developed, followed by the design of new experiments, we find ourselves needing new computational algorithms to interpret the results of the experiments. As the thresholds of the current technology are being probed, the new algorithmic designs are beginning to emerge to meet the challenges of the mass spectrometry data evaluation and interpretation. This dissertation is devoted to computational analysis of mass spectrometric data, involving a combination of different topics and techniques to improve our understanding of biological processes using high-throughput whole organism proteomic studies. It consists of the development of new algorithms to improve the data interpretation of the current tools, introducing a new algorithmic approach for post-translational modification detection, and the characterization of a set of computational simulations for biological agent detection in a complex organism background. These studies are designed to further the capabilities of understanding the results of high-throughput mass spectrometric experiments and their impact in the field of proteomics

    Characterization of proteoforms with unknown post-translational modi cations using the MIScore

    Get PDF
    Various proteoforms may be generated from a single gene due to primary structure alterations (PSAs) such as genetic variations, alternative splicing, and post-translational modifications (PTMs). Top-down mass spectrometry is capable of analyzing intact proteins and identifying patterns of multiple PSAs, making it the method of choice for studying complex proteoforms. In top-down proteomics, proteoform identification is often performed by searching tandem mass spectra against a protein sequence database that contains only one reference protein sequence for each gene or transcript variant in a proteome. Because of the incompleteness of the protein database, an identified proteoform may contain unknown PSAs compared with the reference sequence. Proteoform characterization is to identify and localize PSAs in a proteoform. Although many software tools have been proposed for proteoform identification by top-down mass spectrometry, the characterization of proteoforms in identified proteoform–spectrum matches still relies mainly on manual annotation. We propose to use the Modification Identification Score (MIScore), which is based on Bayesian models, to automatically identify and localize PTMs in proteoforms. Experiments showed that the MIScore is accurate in identifying and localizing one or two modifications

    Development and Integration of Informatic Tools for Qualitative and Quantitative Characterization of Proteomic Datasets Generated by Tandem Mass Spectrometry

    Get PDF
    Shotgun proteomic experiments provide qualitative and quantitative analytical information from biological samples ranging in complexity from simple bacterial isolates to higher eukaryotes such as plants and humans and even to communities of microbial organisms. Improvements to instrument performance, sample preparation, and informatic tools are increasing the scope and volume of data that can be analyzed by mass spectrometry (MS). To accommodate for these advances, it is becoming increasingly essential to choose and/or create tools that can not only scale well but also those that make more informed decisions using additional features within the data. Incorporating novel and existing tools into a scalable, modular workflow not only provides more accurate, contextualized perspectives of processed data, but it also generates detailed, standardized outputs that can be used for future studies dedicated to mining general analytical or biological features, anomalies, and trends. This research developed cyber-infrastructure that would allow a user to seamlessly run multiple analyses, store the results, and share processed data with other users. The work represented in this dissertation demonstrates successful implementation of an enhanced bioinformatics workflow designed to analyze raw data directly generated from MS instruments and to create fully-annotated reports of qualitative and quantitative protein information for large-scale proteomics experiments. Answering these questions requires several points of engagement between informatics and analytical understanding of the underlying biochemistry of the system under observation. Deriving meaningful information from analytical data can be achieved through linking together the concerted efforts of more focused, logistical questions. This study focuses on the following aspects of proteomics experiments: spectra to peptide matching, peptide to protein mapping, and protein quantification and differential expression. The interaction and usability of these analyses and other existing tools are also described. By constructing a workflow that allows high-throughput processing of massive datasets, data collected within the past decade can be standardized and updated with the most recent analyses
    • …
    corecore