4 research outputs found
Biomedical Data Analysis with Prior Knowledge : Modeling and Learning
Modern research in biology and medicine is experiencing a data explosion in quantity and particularly in complexity. Efficient and accurate processing of these datasets demands state-of-the-art computational methods such as probabilistic graphical models, graph-based image analysis and many inference/optimization algorithms. However, the underlying complexity of biomedical experiments rules out direct out-of-the-box applications of these methods and requires novel formulation and enhancement to make them amendable to specific problems. This thesis explores novel approaches for incorporating prior knowledge into the data analysis workflow that leads to quantitative and meaningful interpretation of the datasets and also allows for sufficient user involvement. As discussed in Chapter 1, depending on the complexity of the prior knowledge, these approaches can be categorized as constrained modeling and learning. The first part of the thesis focuses on constrained modeling where the prior is normally explicitly represented as additional potential terms in the problem formulation. These terms prevent or discourage the downstream optimization of the formulation from yielding solutions that contradict the prior knowledge. In Chapter 2, we present a robust method for estimating and tracking the deuterium incorporation in the time-resolved hydrogen exchange (HX) mass spectrometry (MS) experiments with priors such as sparsity and sequential ordering. In Chapter 3, we introduce how to extend a classic Markov random field (MRF) model with a shape prior for cell nucleus segmentation. The second part of the thesis explores learning which addresses problems where the prior varies between different datasets or is too difficult to express explicitly. In this case, the prior is first abstracted as a parametric model and then its optimum parametrization is estimated from a training set using machine learning techniques. In Chapter 4, we extend the popular Rand Index in a cost-sensitive fashion and the problem-specific costs can be learned from manual scorings. This set of approaches becomes more interesting when the input/output becomes structured such as matrices or graphs. In Chapter 5, we present structured learning for cell tracking, a novel approach that learns optimum parameters automatically from a training set and allows for the use of a richer set of features which in turn affords improved tracking performance. Finally, conclusions and outlook are provided in Chapter 6
Arithmetic Combinations of Submodular and Supermodular Optimization and Submodular Generalized Matching for Peptide Identification in Tandem Mass Spectrometry
Thesis (Ph.D.)--University of Washington, 2023Submodular functions have recently shown utility for a number of machine learning applications such as information gathering, document summarization, image segmentation, and string alignment, since they are natural for modeling concepts such as diversity, information, and representativeness. Submodular optimization problems are widely studied under different scenarios, such as submodular minimization without constraints or submodular maximization under a cardinality constraint. However, in real-world applications, the objective function is usually not a simple submodular function (or supermodular function) but is naturally written as an arithmetic combination of submodular and/or supermodular functions. For the first part of this thesis work, we study the property of the widest arithmetic combinations of submodular and supermodular functions and how we can optimize them. The content includes sums f_1+g_1 , divisions f_1/f_2, f_1/g_1, g_1/f_1, g_1/g_2, products f_1*f_2 and p-norms f_1^p+f_2^p, where f and g donates submodular and supermodular respectively. We study the novel optimization problems on these non-submodular functions and propose algorithms to achieve tight approximation guarantees under various constraints. This greatly expands the study on submodular and non-submodular optimizations and draws a rather complete picture of optimizing combinations of submodular and supermodular functions. For the second part of this thesis work, we focus on a biological application on identification of spectra produced by a shotgun proteomics mass spectrometry experiment using submodular generalized matchings. This is commonly performed by searching the observed spectra against a peptide database. The heart of this search procedure is a score function that evaluates the quality of a hypothesized match between an observed spectrum and a theoretical spectrum corresponding to a particular peptide sequence. Accordingly, the success of a spectrum analysis pipeline depends critically upon this peptide-spectrum score function. We develop peptide-spectrum score functions that compute the maximum value of a submodular function under m matroid constraints. We call this procedure a submodular generalized matching (SGM) since it generalizes bipartite matching. We use a greedy algorithm to compute maximization, which can achieve a solution whose objective is guaranteed to be at least 1/(1+m) of the true optimum. The advantage of the SGM framework is that known long-range properties of experimental spectra can be modeled by designing suitable submodular functions and matroid constraints. Experiments on four data sets from various organisms and mass spectrometry platforms show that the SGM approach leads to significantly improved performance compared to several state-of-the-art methods