5 research outputs found
Systematic Evaluation of Protein Sequence Filtering Algorithms for Proteoform Identification Using Top-Down Mass Spectrometry
Complex proteoforms contain various primary structural alterations resulting from variations in genes, RNA, and proteins. Top-down mass spectrometry is commonly used for analyzing complex proteoforms because it provides whole sequence information of the proteoforms. Proteoform identification by top-down mass spectral database search is a challenging computational problem because the types and/or locations of some alterations in target proteoforms are in general unknown. Although spectral alignment and mass graph alignment algorithms have been proposed for identifying proteoforms with unknown alterations, they are extremely slow to align millions of spectra against tens of thousands of protein sequences in high throughput proteome level analyses. Many software tools in this area combine efficient protein sequence filtering algorithms and spectral alignment algorithms to speed up database search. As a result, the performance of these tools heavily relies on the sensitivity and efficiency of their filtering algorithms. Here, we propose two efficient approximate spectrum-based filtering algorithms for proteoform identification. We evaluated the performances of the proposed algorithms and four existing ones on simulated and real top-down mass spectrometry data sets. Experiments showed that the proposed algorithms outperformed the existing ones for complex proteoform identification. In addition, combining the proposed filtering algorithms and mass graph alignment algorithms identified many proteoforms missed by ProSightPC in proteome-level proteoform analyses
Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra*
Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification. However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides. We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the limitation of the Spectral Dictionary approach. We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS searches. Our MS-GappedDictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications (such as searches in the six-frame translation of the human genome) that are prohibitively time consuming with existing approaches. MS-GappedDictionary generates gapped peptides that occupy a niche between accurate but short peptide sequence tags and long but inaccurate full length peptide reconstructions. We show that, contrary to conventional wisdom, some high-quality spectra do not have good peptide sequence tags and introduce gapped tags that have advantages over the conventional peptide sequence tags in MS/MS database searches
Complex Proteoform Identification Using Top-Down Mass Spectrometry
Indiana University-Purdue University Indianapolis (IUPUI)Proteoforms are distinct protein molecule forms created by variations in genes, gene
expression, and other biological processes. Many proteoforms contain multiple primary
structural alterations, including amino acid substitutions, terminal truncations, and posttranslational
modifications. These primary structural alterations play a crucial role in
determining protein functions: proteoforms from the same protein with different alterations
may exhibit different functional behaviors. Because top-down mass spectrometry directly
analyzes intact proteoforms and provides complete sequence information of proteoforms, it
has become the method of choice for the identification of complex proteoforms. Although
instruments and experimental protocols for top-down mass spectrometry have been advancing
rapidly in the past several years, many computational problems in this area remain
unsolved, and the development of software tools for analyzing such data is still at its very
early stage. In this dissertation, we propose several novel algorithms for challenging computational
problems in proteoform identification by top-down mass spectrometry. First, we
present two approximate spectrum-based protein sequence filtering algorithms that quickly
find a small number of candidate proteins from a large proteome database for a query mass
spectrum. Second, we describe mass graph-based alignment algorithms that efficiently identify
proteoforms with variable post-translational modifications and/or terminal truncations.
Third, we propose a Markov chain Monte Carlo method for estimating the statistical signi
ficance of identified proteoform spectrum matches. They are the first efficient algorithms
that take into account three types of alterations: variable post-translational modifications,
unexpected alterations, and terminal truncations in proteoform identification. As a result,
they are more sensitive and powerful than other existing methods that consider only one
or two of the three types of alterations. All the proposed algorithms have been incorporated
into TopMG, a complete software pipeline for complex proteoform identification.
Experimental results showed that TopMG significantly increases the number of identifications
than other existing methods in proteome-level top-down mass spectrometry studies. TopMG will facilitate the applications of top-down mass spectrometry in many areas, such
as the identification and quantification of clinically relevant proteoforms and the discovery
of new proteoform biomarkers.2019-06-2
Bioinformatics methods for annotating genomes using proteomic data
In recent years the number of genome sequencing projects has been exponentially increasing, leaving genome annotation dependent upon primarily automated tools. Recently, proteogenomics studies have attempted to bridge the gap between genomics and proteomics, by actively using proteomic data during the annotation stage. This project attempts to address some limitations in current bioinformatics approaches, such as the identification of N-terminal peptides and those spanning across exons – so called intron-spanning peptides (ISPs). Additionally it presents approaches for determining the quality of gene models. The results provide insights on the N-terminus of proteins (identification strategies, modifications), quality assessment on available gene annotation and performance of gene finders. A new method has also been developed for the identification of ISPs and, although this technique remains challenging, provides a framework in which future developments can be made
Developing a bioinformatics framework for proteogenomics
In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes.
The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence.
This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries.
This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed