258 research outputs found

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph

    masstodon: A Tool for Assigning Peaks and Modeling Electron Transfer Reactions in Top-Down Mass Spectrometry

    Get PDF
    Top-down mass spectrometry methods are becoming continuously more popular in the effort to describe the proteome. They rely on the fragmentation of intact protein ions inside the mass spectrometer. Among the existing fragmentation methods, electron transfer dissociation is known for its precision and wide coverage of different cleavage sites. However, several side reactions can occur under electron transfer dissociation (ETD) conditions, including nondissociative electron transfer and proton transfer reaction. Evaluating their extent can provide more insight into reaction kinetics as well as instrument operation. Furthermore, preferential formation of certain reaction products can reveal important structural information. To the best of our knowledge, there are currently no tools capable of tracing and analyzing the products of these reactions in a systematic way. In this Article, we present in detail masstodon: a computer program for assigning peaks and interpreting mass spectra. Besides being a general purpose tool, masstodon also offers the possibility to trace the products of reactions occurring under ETD conditions and provides insights into the parameters driving them. It is available free of charge under the GNU AGPL V3 public license

    Topics in learning sparse and low-rank models of non-negative data

    Get PDF
    Advances in information and measurement technology have led to a surge in prevalence of high-dimensional data. Sparse and low-rank modeling can both be seen as techniques of dimensionality reduction, which is essential for obtaining compact and interpretable representations of such data. In this thesis, we investigate aspects of sparse and low-rank modeling in conjunction with non-negative data or non-negativity constraints. The first part is devoted to the problem of learning sparse non-negative representations, with a focus on how non-negativity can be taken advantage of. We work out a detailed analysis of non-negative least squares regression, showing that under certain conditions sparsity-promoting regularization, the approach advocated paradigmatically over the past years, is not required. Our results have implications for problems in signal processing such as compressed sensing and spike train deconvolution. In the second part, we consider the problem of factorizing a given matrix into two factors of low rank, out of which one is binary. We devise a provably correct algorithm computing such factorization whose running time is exponential only in the rank of the factorization, but linear in the dimensions of the input matrix. Our approach is extended to noisy settings and applied to an unmixing problem in DNA methylation array analysis. On the theoretical side, we relate the uniqueness of the factorization to Littlewood-Offord theory in combinatorics.Fortschritte in Informations- und Messtechnologie fĂŒhren zu erhöhtem Vorkommen hochdimensionaler Daten. ModellierungsansĂ€tze basierend auf Sparsity oder niedrigem Rang können als Dimensionsreduktion betrachtet werden, die notwendig ist, um kompakte und interpretierbare Darstellungen solcher Daten zu erhalten. In dieser Arbeit untersuchen wir Aspekte dieser AnsĂ€tze in Verbindung mit nichtnegativen Daten oder NichtnegativitĂ€tsbeschrĂ€nkungen. Der erste Teil handelt vom Lernen nichtnegativer sparsamer Darstellungen, mit einem Schwerpunkt darauf, wie NichtnegativitĂ€t ausgenutzt werden kann. Wir analysieren nichtnegative kleinste Quadrate im Detail und zeigen, dass unter gewissen Bedingungen Sparsity-fördernde Regularisierung - der in den letzten Jahren paradigmatisch enpfohlene Ansatz - nicht notwendig ist. Unsere Resultate haben Auswirkungen auf Probleme in der Signalverarbeitung wie Compressed Sensing und die Entfaltung von Pulsfolgen. Im zweiten Teil betrachten wir das Problem, eine Matrix in zwei Faktoren mit niedrigem Rang, von denen einer binĂ€r ist, zu zerlegen. Wir entwickeln dafĂŒr einen Algorithmus, dessen Laufzeit nur exponentiell in dem Rang der Faktorisierung, aber linear in den Dimensionen der gegebenen Matrix ist. Wir erweitern unseren Ansatz fĂŒr verrauschte Szenarien und wenden ihn zur Analyse von DNA-Methylierungsdaten an. Auf theoretischer Ebene setzen wir die Eindeutigkeit der Faktorisierung in Beziehung zur Littlewood-Offord-Theorie aus der Kombinatorik

    Topics in learning sparse and low-rank models of non-negative data

    Get PDF
    Advances in information and measurement technology have led to a surge in prevalence of high-dimensional data. Sparse and low-rank modeling can both be seen as techniques of dimensionality reduction, which is essential for obtaining compact and interpretable representations of such data. In this thesis, we investigate aspects of sparse and low-rank modeling in conjunction with non-negative data or non-negativity constraints. The first part is devoted to the problem of learning sparse non-negative representations, with a focus on how non-negativity can be taken advantage of. We work out a detailed analysis of non-negative least squares regression, showing that under certain conditions sparsity-promoting regularization, the approach advocated paradigmatically over the past years, is not required. Our results have implications for problems in signal processing such as compressed sensing and spike train deconvolution. In the second part, we consider the problem of factorizing a given matrix into two factors of low rank, out of which one is binary. We devise a provably correct algorithm computing such factorization whose running time is exponential only in the rank of the factorization, but linear in the dimensions of the input matrix. Our approach is extended to noisy settings and applied to an unmixing problem in DNA methylation array analysis. On the theoretical side, we relate the uniqueness of the factorization to Littlewood-Offord theory in combinatorics.Fortschritte in Informations- und Messtechnologie fĂŒhren zu erhöhtem Vorkommen hochdimensionaler Daten. ModellierungsansĂ€tze basierend auf Sparsity oder niedrigem Rang können als Dimensionsreduktion betrachtet werden, die notwendig ist, um kompakte und interpretierbare Darstellungen solcher Daten zu erhalten. In dieser Arbeit untersuchen wir Aspekte dieser AnsĂ€tze in Verbindung mit nichtnegativen Daten oder NichtnegativitĂ€tsbeschrĂ€nkungen. Der erste Teil handelt vom Lernen nichtnegativer sparsamer Darstellungen, mit einem Schwerpunkt darauf, wie NichtnegativitĂ€t ausgenutzt werden kann. Wir analysieren nichtnegative kleinste Quadrate im Detail und zeigen, dass unter gewissen Bedingungen Sparsity-fördernde Regularisierung - der in den letzten Jahren paradigmatisch enpfohlene Ansatz - nicht notwendig ist. Unsere Resultate haben Auswirkungen auf Probleme in der Signalverarbeitung wie Compressed Sensing und die Entfaltung von Pulsfolgen. Im zweiten Teil betrachten wir das Problem, eine Matrix in zwei Faktoren mit niedrigem Rang, von denen einer binĂ€r ist, zu zerlegen. Wir entwickeln dafĂŒr einen Algorithmus, dessen Laufzeit nur exponentiell in dem Rang der Faktorisierung, aber linear in den Dimensionen der gegebenen Matrix ist. Wir erweitern unseren Ansatz fĂŒr verrauschte Szenarien und wenden ihn zur Analyse von DNA-Methylierungsdaten an. Auf theoretischer Ebene setzen wir die Eindeutigkeit der Faktorisierung in Beziehung zur Littlewood-Offord-Theorie aus der Kombinatorik

    Algorithms for integrated analysis of glycomics and glycoproteomics by LC-MS/MS

    Get PDF
    The glycoproteome is an intricate and diverse component of a cell, and it plays a key role in the definition of the interface between that cell and the rest of its world. Methods for studying the glycoproteome have been developed for released glycan glycomics and site-localized bottom-up glycoproteomics using liquid chromatography-coupled mass spectrometry and tandem mass spectrometry (LC-MS/MS), which is itself a complex problem. Algorithms for interpreting these data are necessary to be able to extract biologically meaningful information in a high throughput, automated context. Several existing solutions have been proposed but may be found lacking for larger glycopeptides, for complex samples, different experimental conditions, different instrument vendors, or even because they simply ignore fundamentals of glycobiology. I present a series of open algorithms that approach the problem from an instrument vendor neutral, cross-platform fashion to address these challenges, and integrate key concepts from the underlying biochemical context into the interpretation process. In this work, I created a suite of deisotoping and charge state deconvolution algorithms for processing raw mass spectra at an LC scale from a variety of instrument types. These tools performed better than previously published algorithms by enforcing the underlying chemical model more strictly, while maintaining a higher degree of signal fidelity. From this summarized, vendor-normalized data, I composed a set of algorithms for interpreting glycan profiling experiments that can be used to quantify glycan expression. From this I constructed a graphical method to model the active biosynthetic pathways of the sample glycome and dig deeper into those signals than would be possible from the raw data alone. Lastly, I created a glycopeptide database search engine from these components which is capable of identifying the widest array of glycosylation types available, and demonstrate a learning algorithm which can be used to tune the model to better understand the process of glycopeptide fragmentation under specific experimental conditions to outperform a simpler model by between 10% and 15%. This approach can be further augmented with sample-wide or site-specific glycome models to increase depth-of-coverage for glycoforms consistent with prior beliefs

    peak picking und map alignment

    Get PDF
    We study two fundamental processing steps in mass spectrometric data analysis from a theoretical and practical point of view. For the detection and extraction of mass spectral peaks we developed an efficient peak picking algorithm that is independent of the underlying machine or ionization method, and is able to resolve highly convoluted and asymmetric signals. The method uses the multiscale nature of spectrometric data by first detecting the mass peaks in the wavelet-transformed signal before a given asymmetric peak function is fitted to the raw data. In two optional stages, highly overlapping peaks can be separated or all peak parameters can be further improved using techniques from nonlinear optimization. In contrast to currently established techniques, our algorithm is able to separate overlapping peaks of multiply charged peptides in LC-ESI-MS data of low resolution. Furthermore, applied to high-quality MALDI-TOF spectra it yields a high degree of accuracy and precision and compares very favorably with the algorithms supplied by the vendor of the mass spectrometers. On the high-resolution MALDI spectra as well as on the low-resolution LC-MS data set, our algorithm achieves a fast runtime of only a few seconds. Another important processing step that can be found in every typical protocol for labelfree quantification is the combination of results from multiple LC-MS experiments to improve confidence in the obtained measurements or to compare results from different samples. To do so, a multiple alignment of the LC-MS maps needs to be estimated. The alignment has to correct for variations in mass and elution time which are present in all mass spectrometry experiments. For the first time we formally define the multiple LC-MS raw and feature map alignment problem using our own distance function for LC-MS maps. Furthermore, we present a solution to this problem. Our novel algorithm aligns LC-MS samples and matches corresponding ion species across samples. In a first step, it uses an adapted pose clustering approach to efficiently superimpose raw maps as well as feature maps. This is done in a star-wise manner, where the elements of all maps are transformed onto the coordinate system of a reference map. To detect and combine corresponding features in multiple feature maps into a so-called consensus map, we developed an additional step based on techniques from computational geometry. We show that our alignment approach is fast and reliable as compared to five other alignment approaches. Furthermore, we prove its robustness in the presence of noise and its ability to accurately align samples with only few common ion species.Im Rahmen dieser Arbeit beschĂ€ftigen wir uns mit peak picking und map alignment; zwei fundamentalen Prozessierungsschritten bei der Analyse massenspektrometrischer Signale. Im Gegensatz zu vielen anderen peak picking AnsĂ€tzen haben wir einen Algorithmus entwickelt, der alle relevanten Informationen aus den massenspektrometrischen Peaks extrahiert und unabhĂ€ngig von der analytischen Fragestellung und dem MS Instrument ist. Im ersten Teil dieser Arbeit stellen wir diesen generischen peak picking Algorithmus vor. FĂŒr die Detektion der Peaks nutzen wir die Multiskalen-Natur von MS Messungen und erlauben mit einem Wavelet-basierten Ansatz auch das Prozessieren von stark verrauschten und Baseline-behafteten Massenspektren. Neben der exakten m/z Position und dem FWHM Wert eines Peaks werden seine maximale IntensitĂ€t sowie seine GesamtintensitĂ€t bestimmt. Mithilfe des Fits einer analytischen Peakfunktion extrahieren wir außerdem zusĂ€tzliche Informationen ĂŒber die Peakform. Zwei weiterere optionale Schritte ermöglichen zum einen die Trennung stark ĂŒberlappender Peaks sowie die Optimierung der berechneten Peakparameter. Anhand eines niedrig aufgelösten LC-ESI-MS Datensatzes sowie eines hoch aufgelösten MALDI-MS Datensatzes zeigen wir die Effizienz unseres generischen Algorithmus sowie seine schnelle Laufzeit im Vergleich mit kommerziellen peak picking Algorithmen. Ein direkter quantitativer Vergleich mehrer LC-MS Messungen setzt voraus, dass Signale des gleichen Peptids innerhalb unterschiedlicher Maps die gleichen RT und m/z Positionen besitzen. Aufgrund experimenteller Unsicherheiten sind beide Dimension verzerrt. UnabhĂ€ngig vom Prozessierungsstand der LC-MS Maps mĂŒssen die Verzerrungen vor einem Vergleich der Maps korrigiert werden. Mithilfe eines eigens entwickelten Ähnlichkeitsmaßes fĂŒr LC-MS Maps entwickeln wir die erste formale Definition des multiplen LC-MS Roh- und Featuremap Alignment Problems. Weiterhin stellen wir unseren geometrischen Ansatz zur Lösung des Problems vor. Durch die Betrachtung der LC-MS Maps als zwei-dimensionale Punktmengen ist unser Algorithmus unabhĂ€ngig vom Prozessierungsgrad der Maps. Wir verfolgen einen sternförmigen Alignmentansatz, bei dem alle Maps auf eine Referenzmap abgebildet werden. Die Überlagerung der Maps erfolgt hierbei mithilfe eines pose clustering basierten Algorithmus. Diese Überlagerung der Maps löst bereits das Rohmap Alignment Problem. Zur Lösung des multiplen Featuremap Alignment Problems implementieren wir einen zusĂ€tzlichen, effizienten Gruppierungsschritt, der zusammengehörige Peptidsignale in unterschiedlichen Maps einander zuordnet. Wir zeigen die Effizienz und Robustheit unseres Ansatzes auf zwei realen sowie auf drei kĂŒnstlichen DatensĂ€tzen. Wir vergleichen hierbei die GĂŒte sowie die Laufzeit unseres Algorithmus mit fĂŒnf weiteren frei verfĂŒgbaren Featuremap-Alignmentmethoden. In allen Experimenten ĂŒberzeugte unser Algorithmus mit einer schnellen Laufzeit und den besten recall Werten

    Statistical Methods in Metabolomics

    Get PDF
    Metabolomics lies at the fulcrum of the system biology ‘omics’. Metabolic profiling offers researchers new insight into genetic and environmental interactions, responses to pathophysi- ological stimuli and novel biomarker discovery. Metabolomics lacks the simplicity of a single data capturing technique; instead, increasingly sophisticated multivariate statistical techniques are required to tease out useful metabolic features from various complex datasets. In this work, two major metabolomics methods are examined: Nuclear Magnetic Resonance (NMR) Spec- troscopy and Liquid Chromatography-Mass Spectrometry (LC-MS). MetAssimulo, an 1H-NMR metabolic-profile simulator, was developed in part by this author and is described in the Chap- ter 2. Peak positional variation is a phenomenon occurring in NMR spectra that complicates metabolomic analysis so Chapter 3 focuses on modelling the effect of pH on peak position. Analysis of LC-MS data is somewhat more complex given its 2-D structure, so I review existing pre-processing and feature detection techniques in Chapter 4 and then attempt to tackle the issue from a Bayesian viewpoint. A Bayesian Partition Model is developed to distinguish chro- matographic peaks representing useful features from chemical and instrumental interference and noise. Another of the LC-MS pre-processing problems, data binning, is also explored as part of H-MS: a pre-processing algorithm incorporating wavelet smoothing and novel Gaussian and Exponentially Modified Gaussian peak detection. The performance of H-MS is compared alongside two existing pre-processing packages: apLC-MS and XCMS.Open Acces

    Automated Analysis of Biomedical Data from Low to High Resolution

    Get PDF
    Recent developments of experimental techniques and instrumentation allow life scientists to acquire enormous volumes of data at unprecedented resolution. While this new data brings much deeper insight into cellular processes, it renders manual analysis infeasible and calls for the development of new, automated analysis procedures. This thesis describes how methods of pattern recognition can be used to automate three popular data analysis protocols: Chapter 1 proposes a method to automatically locate bimodal isotope distribution patterns in Hydrogen Deuterium Exchange Mass Spectrometry experiments. The method is based on L1-regularized linear regression and allows for easy quantitative analysis of co-populations with different exchange behavior. The sensitivity of the method is tested on a set of manually identified peptides, while its applicability to exploratory data analysis is validated by targeted follow-up peptide identification. Chapter 2 develops a technique to automate peptide quantification for mass spectrometry experiments, based on 16O/18O labeling of peptides. Two different spectrum segmentation algorithms are proposed: one based on image processing and applicable to low resolution data and one exploiting the sparsity of high resolution data. The quantification accuracy is validated on calibration datasets, produced by mixing a set of proteins in pre-defined ratios. Chapter 3 provides a method for automated detection and segmentation of synapses in electron microscopy images of neural tissue. For images acquired by scanning electron microscopy with nearly isotropic resolution, the algorithm is based on geometric features computed in 3D pixel neighborhoods. For transmission electron microscopy images with poor z-resolution, the algorithm uses additional regularization by performing several rounds of pixel classification with features computed on the probability maps of the previous classification round. The validation is performed by comparing the set of synapses detected by the algorithm against a gold standard detection by human experts. For data with nearly isotropic resolution, the algorithm performance is comparable to that of the human experts

    Integrating glycomics, proteomics and glycoproteomics to understand the structural basis for influenza a virus evolution and glycan mediated immune interactions

    Get PDF
    Glycosylation modulates the range and specificity of interactions among glycoproteins and their binding partners. This is important in influenza A virus (IAV) biology because binding of host immune molecules depends on glycosylation of viral surface proteins such as hemagglutinin (HA). Circulating viruses mutate rapidly in response to pressure from the host immune system. As proteins mutate, the virus glycosylation patterns change. The consequence is that viruses evolve to evade host immune responses, which renders vaccines ineffective. Glycan biosynthesis is a non-template driven process, governed by stoichiometric and steric relationships between the enzymatic machinery for glycosylation and the protein being glycosylated. Consequently, protein glycosylation is heterogeneous, thereby making structural analysis and elucidation of precise biological functions extremely challenging. The lack of structural information has been a limiting factor in understanding the exact mechanisms of glycan-mediated interactions of the IAV with host immune-lectins. Genetic sequencing methods allow prediction of glycosylation sites along the protein backbone but are unable to provide exact phenotypic information regarding site occupancy. Crystallography methods are also unable to determine the glycan structures beyond the core residues due to the flexible nature of carbohydrates. This dissertation centers on the development of chromatography and mass spectrometry methods for characterization of site-specific glycosylation in complex glycoproteins and application of these methods to IAV glycomics and glycoproteomics. We combined the site-specific glycosylation information generated using mass spectrometry with information from biochemical assays and structural modeling studies to identify key glycosylation sites mediating interactions of HA with immune lectin surfactant protein-D (SP-D). We also identified the structural features that control glycan processing at these sites, particularly those involving glycan maturation from high-mannose to complex-type, which, in turn, regulate interactions with SP-D. The work presented in this dissertation contributes significantly to the improvement of analytical and bioinformatics methods in glycan and glycoprotein analysis using mass spectrometry and greatly advances the understanding of the structural features regulating glycan microheterogeneity on HA and its interactions with host immune lectins

    Signal and image processing methods for imaging mass spectrometry data

    Get PDF
    Imaging mass spectrometry (IMS) has evolved as an analytical tool for many biomedical applications. This thesis focuses on algorithms for the analysis of IMS data produced by matrix assisted laser desorption/ionization (MALDI) time-of-flight (TOF) mass spectrometer. IMS provides mass spectra acquired at a grid of spatial points that can be represented as hyperspectral data or a so-called datacube. Analysis of this large and complex data requires efficient computational methods for matrix factorization and for spatial segmentation. In this thesis, state of the art processing methods are reviewed, compared and improved versions are proposed. Mathematical models for peak shapes are reviewed and evaluated. A simulation model for MALDI-TOF is studied, expanded and developed into a simulator for 2D or 3D MALDI-TOF-IMS data. The simulation approach paves way to statistical evaluation of algorithms for analysis of IMS data by providing a gold standard dataset. [...
    • 

    corecore