89 research outputs found

    Computational Methods for Mass Spectrometry-based Study of Protein-RNA or Protein-DNA Complexes and Quantitative Metaproteomics

    Get PDF
    In the last decade, the use of high-throughput methods has become increasingly popular in various fields of life sciences. Today, a wide range of technologies exist that allow gathering detailed quantitative insights into biological systems. With improved instrumentation and technological advances, a massive growth in data volume from these techniques has been observed. Bioinformatics copes with these heaps of data by providing computational methods that process raw data to extract biological knowledge. Computational mass spectrometry is a research field in bioinformatics that collects and analyzes data from mass-spectrometric high-throughput experiments. In this thesis, we present two new methods as well as a new data format for computational mass spectrometry. The first method applies to a scientific problem from the field of structural biology: to determine spatial interactions between protein and nucleic acids. For this purpose, we develop experimental protocols, programs, and analysis workflows that allow identifying UV-induced cross-links in (ribo-)nucleoprotein complexes from mass spectrometry data. An outstanding feature of our method is the ability to exactly localize amino acids and (ribo-)nucleotides in contact with each other. Applied to data from yeast and human we identify new interaction partners with, to date, unmatched resolution. The second method applies to metaproteomic studies of complex communities of microorganisms. In an unmanageable number, bacteria, simple fungi, or plants populate the most varied habitats. They are found in a high number of symbiotic or parasitic relationships which serve predominantly for the uptake of nutrients. Organisms differ in their biochemical repertoire allowing them to decompose a wide range of substrates. Remarkably, this enables functional groups of soil bacteria to even nourish themselves from environmental toxins. We present a method from the field of metaproteomics, which allows for identification of organisms involved in substrate degradation as well as methods to group them according to their function in the degradation process. To this end, we use substrates labeled with stable isotopes, which are metabolized by the organisms. The isotope abundance in proteins serves as an indicator for the conversion of the substrate. This abundance is automatically determined by our novel computational method and assigned to the individual organisms. The automation of this process reduces the manual work from several months to a few minutes and, thus, enables large study sizes. The third part of this work contributes to the better communication and processing of results from metabolomics and proteomics studies. We present a tabular, standardized, human-readable and machine-processable data format mzTab as a complement to existing data formats. We provide software components that allow processing of the format and demonstrate how the format can be integrated into complex proteomic and metabolomic workflows. The recent acceptance of mzTab by the largest proteomic data repositories represents a significant success. Also, we see an already widespread adoption by academic software developers and the first support by a commercial software vendor. Our novel format facilitates meta-analyses and makes research results from the field of proteomics and metabolomics available to scientists from other research areas

    LFQ-Based Peptide and Protein Intensity Differential Expression Analysis

    Get PDF
    Testing for significant differences in quantities at the protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a given proteomics quantification software, many tools and R packages exist to perform the final tasks of imputation, summarization, normalization, and statistical testing. To evaluate the effects of packages and settings in their substeps on the final list of significant proteins, we studied several packages on three public data sets with known expected protein fold changes. We found that the results between packages and even across different parameters of the same package can vary significantly. In addition to usability aspects and feature/compatibility lists of different packages, this paper highlights sensitivity and specificity trade-offs that come with specific packages and settings

    Tissue-based absolute quantification using large-scale TMT and LFQ experiments

    Get PDF
    Relative and absolute intensity-based protein quantification across cell lines, tissue atlases and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity and correlation with RNA expression. Most studies provide MS1 feature-based label-free quantitative (LFQ) datasets; however, growing numbers of isobaric tandem mass tags (TMT) datasets remain unexplored. Here, we compare traditional intensity-based absolute quantification (iBAQ) proteome abundance ranking to an analogous method using reporter ion proteome abundance ranking with data from an experiment where LFQ and TMT were measured on the same samples. This new TMT method substitutes reporter ion intensities for MS1 feature intensities in the iBAQ framework. Additionally, we compared LFQ-iBAQ values to TMT-iBAQ values from two independent large-scale tissue atlas datasets (one LFQ and one TMT) using robust bottom-up proteomic identification, normalisation and quantitation workflows

    Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

    Get PDF
    We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to >5% of the total number of peptides identified

    Differential Enzymatic <sup>16</sup>O/<sup>18</sup>O Labeling for the Detection of Cross-Linked Nucleic Acid-Protein Heteroconjugates

    Get PDF
    Cross-linking of nucleic acids to proteins in combination with mass spectrometry permits the precise identification of interacting residues between nucleic acid-protein complexes. However, the mass spectrometric identification and characterization of cross-linked nucleic acid-protein heteroconjugates within a complex sample is challenging. Here we establish a novel enzymatic differential O-16/O-18-labeling approach, which uniquely labels heteroconjugates. We have developed an automated data analysis workflow based on OpenMS for the identification of differentially isotopically labeled heteroconjugates against a complex background. We validated our method using synthetic model DNA oligonucleotide-peptide heteroconjugates, which were subjected to the labeling reaction and analyzed by high-resolution FTICR mass spectrometry

    Ten Simple Rules for Taking Advantage of Git and GitHub.

    Get PDF
    Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.Biotechnology and Biological Sciences Research CouncilThis is the final version of the article. It first appeared from Public Library of Science via https://doi.org/10.1371/journal.pcbi.1004947

    BioContainers: An open-source and community-driven framework for software standardization

    Get PDF
    Motivation BioContainers (biocontainers.pro) is an open-source and community-driven framework which provides platform independent executable environments for bioinformatics software. BioContainers allows labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. BioContainers is based on popular open-source projects Docker and rkt frameworks, that allow software to be installed and executed under an isolated and controlled environment. Also, it provides infrastructure and basic guidelines to create, manage and distribute bioinformatics containers with a special focus on omics technologies. These containers can be integrated into more comprehensive bioinformatics pipelines and different architectures (local desktop, cloud environments or HPC clusters). Availability and Implementation The software is freely available at github.com/BioContainers/.publishedVersio
    • …
    corecore