3,421 research outputs found
Non-metric similarity search of tandem mass spectra including posttranslational modifications
AbstractIn biological applications, the tandem mass spectrometry is a widely used method for determining protein and peptide sequences from an “in vitro” sample. The sequences are not determined directly, but they must be interpreted from the mass spectra, which is the output of the mass spectrometer. This work is focused on a similarity-search approach to mass spectra interpretation, where the parameterized Hausdorff distance (dHP) is used as the similarity. In order to provide an efficient similarity search under dHP, the metric access methods and the TriGen algorithm (controlling the metricity of dHP) are employed. Moreover, the search model based on the dHP supports posttranslational modifications (PTMs) in the query mass spectra, what is typically a problem when an indexing approach is used. Our approach can be utilized as a coarse filter by any other database approach for mass spectra interpretation
Parallel algorithms for real-time peptide-spectrum matching
Tandem mass spectrometry is a powerful experimental tool used in molecular biology to determine the composition of protein mixtures. It has become a standard technique for protein identification. Due to the rapid development of mass spectrometry technology, the instrument can now produce a large number of mass spectra which are used for peptide identification. The increasing data size demands efficient software tools to perform peptide identification.
In a tandem mass experiment, peptide ion selection algorithms generally select only the most abundant peptide ions for further fragmentation. Because of this, the low-abundance proteins in a sample rarely get identified. To address this problem, researchers develop the notion of a `dynamic exclusion list', which maintains a list of newly selected peptide ions, and it ensures these peptide ions do not get selected again for a certain time. In this way, other peptide ions will get more opportunity to be selected and identified, allowing for identification of peptides of lower abundance.
However, a better method is to also include the identification results into the `dynamic exclusion list' approach. In order to do this, a real-time peptide identification algorithm is required.
In this thesis, we introduce methods to improve the speed of peptide identification so that the `dynamic exclusion list' approach can use the peptide identification results without affecting the throughput of the instrument. Our work is based on RT-PSM, a real-time program for peptide-spectrum matching with statistical significance. We profile the speed of RT-PSM and find out that the peptide-spectrum scoring module is the most time consuming portion.
Given by the profiling results, we introduce methods to parallelize the peptide-spectrum scoring algorithm. In this thesis, we propose two parallel algorithms using different technologies. We introduce parallel peptide-spectrum matching using SIMD instructions. We implemented and tested the parallel algorithm on Intel SSE architecture. The test results show that a 18-fold speedup on the entire process is obtained. The second parallel algorithm is developed using NVIDIA CUDA technology. We describe two CUDA kernels based on different algorithms and compare the performance of the two kernels. The more efficient algorithm is integrated into RT-PSM. The time measurement results show that a 190-fold speedup on the scoring module is achieved and 26-fold speedup on the entire process is obtained. We perform profiling on the CUDA version again to show that the scoring module has been optimized sufficiently to the point where it is no longer the most time-consuming module in the CUDA version of RT-PSM.
In addition, we evaluate the feasibility of creating a metric index to reduce the number of candidate peptides. We describe evaluation methods, and show that general indexing methods are not likely feasible for RT-PSM
Mass Spectra Prediction with Structural Motif-based Graph Neural Networks
Mass spectra, which are agglomerations of ionized fragments from targeted
molecules, play a crucial role across various fields for the identification of
molecular structures. A prevalent analysis method involves spectral library
searches,where unknown spectra are cross-referenced with a database. The
effectiveness of such search-based approaches, however, is restricted by the
scope of the existing mass spectra database, underscoring the need to expand
the database via mass spectra prediction. In this research, we propose the
Motif-based Mass Spectrum Prediction Network (MoMS-Net), a system that predicts
mass spectra using the information derived from structural motifs and the
implementation of Graph Neural Networks (GNNs). We have tested our model across
diverse mass spectra and have observed its superiority over other existing
models. MoMS-Net considers substructure at the graph level, which facilitates
the incorporation of long-range dependencies while using less memory compared
to the graph transformer model.Comment: 19 pages, 3figure
Harvest: an open-source tool for the validation and improvement of peptide identification metrics and fragmentation exploration
<p>Abstract</p> <p>Background</p> <p>Protein identification using mass spectrometry is an important tool in many areas of the life sciences, and in proteomics research in particular. Increasing the number of proteins correctly identified is dependent on the ability to include new knowledge about the mass spectrometry fragmentation process, into computational algorithms designed to separate true matches of peptides to unidentified mass spectra from spurious matches. This discrimination is achieved by computing a function of the various features of the potential match between the observed and theoretical spectra to give a numerical approximation of their similarity. It is these underlying "metrics" that determine the ability of a protein identification package to maximise correct identifications while limiting false discovery rates. There is currently no software available specifically for the simple implementation and analysis of arbitrary novel metrics for peptide matching and for the exploration of fragmentation patterns for a given dataset.</p> <p>Results</p> <p>We present Harvest: an open source software tool for analysing fragmentation patterns and assessing the power of a new piece of information about the MS/MS fragmentation process to more clearly differentiate between correct and random peptide assignments. We demonstrate this functionality using data metrics derived from the properties of individual datasets in a peptide identification context. Using Harvest, we demonstrate how the development of such metrics may improve correct peptide assignment confidence in the context of a high-throughput proteomics experiment and characterise properties of peptide fragmentation.</p> <p>Conclusions</p> <p>Harvest provides a simple framework in C++ for analysing and prototyping metrics for peptide matching, the core of the protein identification problem. It is not a protein identification package and answers a different research question to packages such as Sequest, Mascot, X!Tandem, and other protein identification packages. It does not aim to maximise the number of assigned peptides from a set of unknown spectra, but instead provides a method by which researchers can explore fragmentation properties and assess the power of novel metrics for peptide matching in the context of a given experiment. Metrics developed using Harvest may then become candidates for later integration into protein identification packages.</p
A nonparametric model for quality control of database search results in shotgun proteomics
<p>Abstract</p> <p>Background</p> <p>Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods.</p> <p>Results</p> <p>In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets.</p> <p>Conclusion</p> <p>Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.</p
HypoRiPPAtlas as an Atlas of hypothetical natural products for mass spectrometry database search
Recent analyses of public microbial genomes have found over a million biosynthetic gene clusters, the natural products of the majority of which remain
unknown. Additionally, GNPS harbors billions of mass spectra of natural products without known structures and biosynthetic genes. We bridge the gap
between large-scale genome mining and mass spectral datasets for natural
product discovery by developing HypoRiPPAtlas, an Atlas of hypothetical
natural product structures, which is ready-to-use for in silico database search
of tandem mass spectra. HypoRiPPAtlas is constructed by mining genomes
using seq2ripp, a machine-learning tool for the prediction of ribosomally
synthesized and post-translationally modified peptides (RiPPs). In HypoRiPPAtlas, we identify RiPPs in microbes and plants. HypoRiPPAtlas could be
extended to other natural product classes in the future by implementing
corresponding biosynthetic logic. This study paves the way for large-scale
explorations of biosynthetic pathways and chemical structures of microbial
and plant RiPP classes
Score regularization for peptide identification
<p>Abstract</p> <p>Background</p> <p>Peptide identification from tandem mass spectrometry (MS/MS) data is one of the most important problems in computational proteomics. This technique relies heavily on the accurate assessment of the quality of peptide-spectrum matches (PSMs). However, current MS technology and PSM scoring algorithm are far from perfect, leading to the generation of incorrect peptide-spectrum pairs. Thus, it is critical to develop new post-processing techniques that can distinguish true identifications from false identifications effectively.</p> <p>Results</p> <p>In this paper, we present a consistency-based PSM re-ranking method to improve the initial identification results. This method uses one additional assumption that two peptides belonging to the same protein should be correlated to each other. We formulate an optimization problem that embraces two objectives through regularization: the smoothing consistency among scores of correlated peptides and the fitting consistency between new scores and initial scores. This optimization problem can be solved analytically. The experimental study on several real MS/MS data sets shows that this re-ranking method improves the identification performance.</p> <p>Conclusions</p> <p>The score regularization method can be used as a general post-processing step for improving peptide identifications. Source codes and data sets are available at: <url>http://bioinformatics.ust.hk/SRPI.rar</url>.</p
Algorithms for Glycan Structure Identification with Tandem Mass Spectrometry
Glycosylation is a frequently observed post-translational modification (PTM) of proteins. It has been estimated over half of eukaryotic proteins in nature are glycoproteins. Glycoprotein analysis plays a vital role in drug preparation. Thus, characterization of glycans that are linked to proteins has become necessary in glycoproteomics. Mass spectrometry has become an effective analytical technique for glycoproteomics analysis because of its high throughput and sensitivity. The large amount of spectral data collected in a mass spectrometry experiment makes manual interpretation impossible and requires effective computational approaches for automated analysis. Different algorithmic solutions have been proposed to address the challenges in glycoproteomics analysis based on mass spectrometry. However, new algorithms that can identify intact glycopeptides are still demanded to improve result accuracy.
In this research, a glycan is represented as a rooted unordered labelled tree and we focus on developing effective algorithms to determine glycan structures from tandem mass spectra. Interpreting the tandem mass spectra of glycopeptides with a de novo sequencing method is essential to identifying novel glycan structures. Thus, we mathematically formulated the glycan de novo sequencing problem and propose a heuristic algorithm for glycan de novo sequencing from HCD tandem mass spectra of glycopeptides.
Characterizing glycans from MS/MS with a de novo sequencing method requires high-quality mass spectra for accurate results. The database search method usually has the ability to obtain more reliable results since it has the assistance of glycan structural information. Thus, we propose a de novo sequencing assisted database search method, GlycoNovoDB, for mass spectra interpretation
Protein Sequences Identification using NM-tree
ABSTRACT We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance dHP . Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed
Recommended from our members
Molecular and Microbial Microenvironments in Chronically Diseased Lungs Associated with Cystic Fibrosis.
To visualize the personalized distributions of pathogens and chemical environments, including microbial metabolites, pharmaceuticals, and their metabolic products, within and between human lungs afflicted with cystic fibrosis (CF), we generated three-dimensional (3D) microbiome and metabolome maps of six explanted lungs from three cystic fibrosis patients. These 3D spatial maps revealed that the chemical environments differ between patients and within the lungs of each patient. Although the microbial ecosystems of the patients were defined by the dominant pathogen, their chemical diversity was not. Additionally, the chemical diversity between locales in the lungs of the same individual sometimes exceeded interindividual variation. Thus, the chemistry and microbiome of the explanted lungs appear to be not only personalized but also regiospecific. Previously undescribed analogs of microbial quinolones and antibiotic metabolites were also detected. Furthermore, mapping the chemical and microbial distributions allowed visualization of microbial community interactions, such as increased production of quorum sensing quinolones in locations where Pseudomonas was in contact with Staphylococcus and Granulicatella, consistent with in vitro observations of bacteria isolated from these patients. Visualization of microbe-metabolite associations within a host organ in early-stage CF disease in animal models will help elucidate the complex interplay between the presence of a given microbial structure, antibiotics, metabolism of antibiotics, microbial virulence factors, and host responses.IMPORTANCE Microbial infections are now recognized to be polymicrobial and personalized in nature. Comprehensive analysis and understanding of the factors underlying the polymicrobial and personalized nature of infections remain limited, especially in the context of the host. By visualizing microbiomes and metabolomes of diseased human lungs, we reveal how different the chemical environments are between hosts that are dominated by the same pathogen and how community interactions shape the chemical environment or vice versa. We highlight that three-dimensional organ mapping methods represent hypothesis-building tools that allow us to design mechanistic studies aimed at addressing microbial responses to other microbes, the host, and pharmaceutical drugs
- …