Indiana University-Purdue University Indianapolis (IUPUI)Proteoforms are distinct protein molecule forms created by variations in genes, gene
expression, and other biological processes. Many proteoforms contain multiple primary
structural alterations, including amino acid substitutions, terminal truncations, and posttranslational
modifications. These primary structural alterations play a crucial role in
determining protein functions: proteoforms from the same protein with different alterations
may exhibit different functional behaviors. Because top-down mass spectrometry directly
analyzes intact proteoforms and provides complete sequence information of proteoforms, it
has become the method of choice for the identification of complex proteoforms. Although
instruments and experimental protocols for top-down mass spectrometry have been advancing
rapidly in the past several years, many computational problems in this area remain
unsolved, and the development of software tools for analyzing such data is still at its very
early stage. In this dissertation, we propose several novel algorithms for challenging computational
problems in proteoform identification by top-down mass spectrometry. First, we
present two approximate spectrum-based protein sequence filtering algorithms that quickly
find a small number of candidate proteins from a large proteome database for a query mass
spectrum. Second, we describe mass graph-based alignment algorithms that efficiently identify
proteoforms with variable post-translational modifications and/or terminal truncations.
Third, we propose a Markov chain Monte Carlo method for estimating the statistical signi
ficance of identified proteoform spectrum matches. They are the first efficient algorithms
that take into account three types of alterations: variable post-translational modifications,
unexpected alterations, and terminal truncations in proteoform identification. As a result,
they are more sensitive and powerful than other existing methods that consider only one
or two of the three types of alterations. All the proposed algorithms have been incorporated
into TopMG, a complete software pipeline for complex proteoform identification.
Experimental results showed that TopMG significantly increases the number of identifications
than other existing methods in proteome-level top-down mass spectrometry studies. TopMG will facilitate the applications of top-down mass spectrometry in many areas, such
as the identification and quantification of clinically relevant proteoforms and the discovery
of new proteoform biomarkers.2019-06-2