Arithmetic Combinations of Submodular and Supermodular Optimization and Submodular Generalized Matching for Peptide Identification in Tandem Mass Spectrometry

Abstract

Thesis (Ph.D.)--University of Washington, 2023Submodular functions have recently shown utility for a number of machine learning applications such as information gathering, document summarization, image segmentation, and string alignment, since they are natural for modeling concepts such as diversity, information, and representativeness. Submodular optimization problems are widely studied under different scenarios, such as submodular minimization without constraints or submodular maximization under a cardinality constraint. However, in real-world applications, the objective function is usually not a simple submodular function (or supermodular function) but is naturally written as an arithmetic combination of submodular and/or supermodular functions. For the first part of this thesis work, we study the property of the widest arithmetic combinations of submodular and supermodular functions and how we can optimize them. The content includes sums f_1+g_1 , divisions f_1/f_2, f_1/g_1, g_1/f_1, g_1/g_2, products f_1*f_2 and p-norms f_1^p+f_2^p, where f and g donates submodular and supermodular respectively. We study the novel optimization problems on these non-submodular functions and propose algorithms to achieve tight approximation guarantees under various constraints. This greatly expands the study on submodular and non-submodular optimizations and draws a rather complete picture of optimizing combinations of submodular and supermodular functions. For the second part of this thesis work, we focus on a biological application on identification of spectra produced by a shotgun proteomics mass spectrometry experiment using submodular generalized matchings. This is commonly performed by searching the observed spectra against a peptide database. The heart of this search procedure is a score function that evaluates the quality of a hypothesized match between an observed spectrum and a theoretical spectrum corresponding to a particular peptide sequence. Accordingly, the success of a spectrum analysis pipeline depends critically upon this peptide-spectrum score function. We develop peptide-spectrum score functions that compute the maximum value of a submodular function under m matroid constraints. We call this procedure a submodular generalized matching (SGM) since it generalizes bipartite matching. We use a greedy algorithm to compute maximization, which can achieve a solution whose objective is guaranteed to be at least 1/(1+m) of the true optimum. The advantage of the SGM framework is that known long-range properties of experimental spectra can be modeled by designing suitable submodular functions and matroid constraints. Experiments on four data sets from various organisms and mass spectrometry platforms show that the SGM approach leads to significantly improved performance compared to several state-of-the-art methods

    Similar works

    Full text

    thumbnail-image