6,247 research outputs found

    Multiplierz: An Extensible API Based Desktop Environment for Proteomics Data Analysis

    Get PDF
    BACKGROUND. Efficient analysis of results from mass spectrometry-based proteomics experiments requires access to disparate data types, including native mass spectrometry files, output from algorithms that assign peptide sequence to MS/MS spectra, and annotation for proteins and pathways from various database sources. Moreover, proteomics technologies and experimental methods are not yet standardized; hence a high degree of flexibility is necessary for efficient support of high- and low-throughput data analytic tasks. Development of a desktop environment that is sufficiently robust for deployment in data analytic pipelines, and simultaneously supports customization for programmers and non-programmers alike, has proven to be a significant challenge. RESULTS. We describe multiplierz, a flexible and open-source desktop environment for comprehensive proteomics data analysis. We use this framework to expose a prototype version of our recently proposed common API (mzAPI) designed for direct access to proprietary mass spectrometry files. In addition to routine data analytic tasks, multiplierz supports generation of information rich, portable spreadsheet-based reports. Moreover, multiplierz is designed around a "zero infrastructure" philosophy, meaning that it can be deployed by end users with little or no system administration support. Finally, access to multiplierz functionality is provided via high-level Python scripts, resulting in a fully extensible data analytic environment for rapid development of custom algorithms and deployment of high-throughput data pipelines. CONCLUSION. Collectively, mzAPI and multiplierz facilitate a wide range of data analysis tasks, spanning technology development to biological annotation, for mass spectrometry-based proteomics research.Dana-Farber Cancer Institute; National Human Genome Research Institute (P50HG004233); National Science Foundation Integrative Graduate Education and Research Traineeship grant (DGE-0654108

    GAGrank: Software for Glycosaminoglycan Sequence Ranking using a Bipartite Graph Model

    Get PDF
    The Sulfated Glycosaminoglycans (GAGs) Are Long, Linear Polysaccharide Chains that Are Typically Found as the Glycan Portion of Proteoglycans. These GAGs Are Characterized by Repeating Disaccharide Units with Variable Sulfation and Acetylation Patterns Along the Chain. GAG Length and Modification Patterns Have Profound Impacts on Growth Factor Signaling Mechanisms Central to Numerous Physiological Processes. Electron Activated Dissociation Tandem Mass Spectrometry is a Very Effective Technique for Assigning the Structures of GAG Saccharides; However, Manual Interpretation of the Resulting Complex Tandem Mass Spectra is a Difficult and Time-Consuming Process that Drives the Development of Computational Methods for Accurate and Efficient Sequencing. We Have Recently Published GAGfinder, the First Peak Picking and Elemental Composition Assignment Algorithm Specifically Designed for GAG Tandem Mass Spectra. Here, We Present GAGrank, a Novel Network-Based Method for Determining GAG Structure using Information Extracted from Tandem Mass Spectra using GAGfinder. GAGrank is based on Google\u27s PageRank Algorithm for Ranking Websites for Search Engine Output. in Particular, It is an Implementation of BiRank, an Extension of PageRank for Bipartite Networks. in Our Implementation, the Two Partitions Comprise Every Possible Sequence for a Given GAG Composition and the Tandem MS Fragments Found using GAGfinder. Sequences Are Given a Higher Ranking If They Link to Many Important Fragments. using the Simulated Annealing Probabilistic Optimization Technique, We Optimized GAGrank\u27s Parameters on Ten Training Sequences. We Then Validated GAGrank\u27s Performance on Three Validation Sequences. We Also Demonstrated GAGrank\u27s Ability to Sequence Isomeric Mixtures using Two Mixtures at Five Different Ratios

    Methods in automated glycosaminoglycan tandem mass spectra analysis

    Get PDF
    Glycosylation is the process by which a glycan is enzymatically attached to a protein, and is one of the most common post-translational modifications in nature. One class of glycans is the glycosaminoglycans (GAGs), which are long, linear polysaccharides that are variably sulfated and make up the glycan portion of proteoglycans (PGs). PGs are located on the cellular surface and in the extracellular matrix (ECM), making them important molecules for cell signaling and ligand binding. The GAG sulfation sequence is a determining factor for the signaling capacity of binding complexes, so accurate determination of the sequence is critical. Historically, GAG sequencing using tandem mass spectrometry (MS2) has been a difficult, manual process; however, with the advent of faster computational techniques and higher-resolution MS2, high-throughput GAG sequencing is within reach. Two steps in the pipeline of biomolecule sequencing using MS2 are discovery and interpretation of spectral peaks. The discovery step traditionally is performed using methods that rely on the concept of averagine, or the average molecular building block for the analyte in question. These methods were developed for protein sequencing, but perform considerably worse on GAG sequences, due to the non-uniform distribution of sulfur atoms along the chain and the relatively high isotope abundance of 34S. The interpretation step traditionally is performed manually, which takes time and introduces potential user error. To combat these problems, I developed GAGfinder, the first GAG-specific MS2 peak finding and annotation software. GAGfinder is described in detail in chapter two. Another step in MS2 sequencing is the determination of the sequence using the found MS2 fragments. For a given GAG composition, there are many possible sequences, and peak finding algorithms such as GAGfinder return a list of the peaks in the MS2 mass spectrum. The many-to-many relationship between sequences and fragments can be represented using a bipartite network, and node-ranking techniques can be employed to generate likelihood scores for possible sequences. I developed a bipartite network-based sequencing tool, GAGrank, based on a bipartite network extension of Google’s PageRank algorithm for ranking websites. GAGrank is described in detail in chapter three

    QALM - a tool for automating quantitative analysis of LC-MS-MS/MS data

    Get PDF
    The goal of bioinformatics is to support science and research in the field of biology through the application of information technology. Proteomics is a field within biology that deals with the study of proteins. This paper describes QALM, an application developed to automate and simplify a specific type of proteomics analysis. QALM is first and foremost a proof of concept through which certain options for implementing such automation have been explored. Although a functional and usable application has been created, this should primarily be considered a stepping stone for similar applications in the future. Currently QALM is a desktop tool for importing and exporting data, inte- grating and communicating with external systems for the analysis of such data, and finally generating reports to present the results. It currently runs only un- der the Linux operating system, but it should be possible to change this fairly easily.Master i InformatikkMAMN-INFINF39

    Molecular Formula Identification using High Resolution Mass Spectrometry: Algorithms and Applications in Metabolomics and Proteomics

    Get PDF
    Wir untersuchen mehrere theoretische und praktische Aspekte der Identifikation der Summenformel von Biomolekülen mit Hilfe von hochauflösender Massenspektrometrie. Durch die letzten Forschritte in der Instrumentation ist die Massenspektrometrie (MS) zur einen der Schlüsseltechnologien für die Analyse von Biomolekülen in der Proteomik und Metabolomik geworden. Sie misst die Massen der Moleküle in der Probe mit hoher Genauigkeit, und ist für die Messdatenerfassung im Hochdurchsatz gut geeignet. Eine der Kernaufgaben in der MS-basierten Proteomik und Metabolomik ist die Identifikation der Moleküle in der Probe. In der Metabolomik unterliegen Metaboliten der Strukturaufklärung, beginnend bei der Summenformel eines Moleküls, d.h. der Anzahl der Atome jedes Elements. Dies ist der entscheidende Schritt in der Identifikation eines unbekannten Metabolits, da die festgelegte Formel die Anzahl der möglichen Molekülstrukturen auf eine viel kleinere Menge reduziert, die mit Methoden der automatischen Strukturaufklärung weiter analysiert werden kann. Nach der Vorverarbeitung ist die Ausgabe eines Massenspektrometers eine Liste von Peaks, die den Molekülmassen und deren Intensitäten, d.h. der Anzahl der Moleküle mit einer bestimmten Masse, entspricht. Im Prinzip können die Summenformel kleiner Moleküle nur mit präzisen Massen identifiziert werden. Allerdings wurde festgestellt, dass aufgrund der hohen Anzahl der chemisch legitimer Formeln in oberen Massenbereich eine exzellente Massengenaugkeit alleine für die Identifikation nicht genügt. Hochauflösende MS erlaubt die Bestimmung der Molekülmassen und Intensitäten mit hervorragender Genauigkeit. In dieser Arbeit entwickeln wir mehrere Algorithmen und Anwendungen, die diese Information zur Identifikation der Summenformel der Biomolekülen anwenden

    Data-independent acquisition mass spectrometry for human gut microbiota metaproteome analysis

    Get PDF
    Human digestive tract microbiota is a diverse community of microorganisms having complex interactions between microbes and the human host. Observing the functions carried out by microbes is essential for gaining understanding on the role of gut microbiota in human health and associations to diseases. New methods and tools are needed for acquirement of functional information from complex microbial samples. Metagenomic approaches focus on taxonomy or gene based function potential but lack power in the discovery of the actual functions carried out by the microbes. Metaproteomic methods are required to uncover the functions. The current highthroughput metaproteomics methods are based on mass spectrometry which is capable of identifying and quantifying ionized protein fragments, called peptides. Proteins can be inferred from the peptides and the functions associated with protein expression can be determined by using protein databases. Currently the most widely used data-dependent acquisition (DDA) method records only the most intensive ions in a semi-stochastic manner, which reduces reproducibility and produces incomplete records impairing quantification. Alternative data-independent acquisition (DIA) systematically records all ions and has been proposed as a replacement for DDA. However, recording all ions produces highly convoluted spectra from multiple peptides and, for this reason, it has not been known if and how DIA can be applied to metaproteomics where the number of different peptides is high. This thesis work introduced the DIA method for metaproteomic data analysis. The method was shown to achieve high reproducibility enabling the usage of only a single analysis per sample while DDA requires multiple. An easy to use open source software package, DIAtools, was developed for the analysis. Finally, the DIA analysis method was applied to study human gut microbiota and carbohydrate-active enzymes expressed in members of gut microbiota.Ihmisen suolistomikrobiston analyysi DIAmassaspektrometriamenetelmällä Ihmisen suoliston mikrobisto on monien mikro-organismien yhteisö, joka on vuorovaikutuksessa ihmisen kehon kanssa. Suoliston mikrobien toiminnan ymmärtäminen on keskeistä niiden roolista ihmisen terveyteen ja sairauksiin. Uusia tutkimusmenetelmiä tarvitaan mikrobien toiminnallisuuden määrittämiseen monimutkaisista, useita mikrobeja sisältävistä, näytteistä. Yleisesti käytetyt metagenomiikan menetelmät keskittyvät taksonomiaan tai geenien perusteella ennustettuihin funktioihin, mutta metaproteomiikkaa tarvitaan mikrobien toiminnan selvittämiseen. Metaproteomiikka-analyysiin voidaan käyttää massaspektrometriaa, jolla pystytään tunnistamaan ja määrittämään ionisoitujen proteiinien osasten, peptidien, määrä. Proteiinit voidaan päätellä peptideistä ja näin pystytään määrittämään proteiineihin liittyviä toimintoja hyödyntäen proteiinitietokantoja. Nykyisin käytetty DDA-menetelmä tunnistaa vain runsaimmin esiintyvät ionit, mikä rajoittaa sen hyödyntämistä. Siinä mitattavien ionien valinta on jossain määrin satunnainen, mikä vähentää tulosten toistettavuutta. Vaihtoehtoinen DIA-menetelmä analysoi järjestelmällisesti kaikki ionit ja kyseistä menetelmää on ehdotettu DDA:n tilalle. DIA-menetelmä tuottaa päällekkäisiä peptidispektrejä ja siksi aiemmin ei ole ollut tiedossa, onko se soveltuva menetelmä tai miten sitä olisi mahdollista soveltaa metaproteomiikkaan, jossa on suuri määrä erilaisia peptidejä. Tämä tutkimus esittelee soveltuvia tapoja DIA-menetelmän käyttöön metaproteomiikkadatan analysoinnissa. Työssä osoitetaan, että DIA-metaproteomiikka tuottaa luotettavasti toistettavia tuloksia. DIA-menetelmää käyttäessä riittää, että näyte analysoidaan vain yhden kerran, kun vastaavasti DDA-menetelmän käyttö vaatii useamman analysointikerran. Tutkimuksessa kehitettiin avoimen lähdekoodin ohjelmisto DIAtools, joka toteuttaa kehitetyt DIA-datojen analysointimenetelmät. Lopuksi DIA-analyysiä sovellettiin ruoansulatuskanavan mikrobien ja niiden tuottamien CAZy-entsyymien tutkimiseksi

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph
    corecore