2,749 research outputs found

    Homology-Based Functional Proteomics By Mass Spectrometry and Advanced Informatic Methods

    Get PDF
    Functional characterization of biochemically-isolated proteins is a central task in the biochemical and genetic description of the biology of cells and tissues. Protein identification by mass spectrometry consists of associating an isolated protein with a specific gene or protein sequence in silico, thus inferring its specific biochemical function based upon previous characterizations of that protein or a similar protein having that sequence identity. By performing this analysis on a large scale in conjunction with biochemical experiments, novel biological knowledge can be developed. The study presented here focuses on mass spectrometry-based proteomics of organisms with unsequenced genomes and corresponding developments in biological sequence database searching with mass spectrometry data. Conventional methods to identify proteins by mass spectrometry analysis have employed proteolytic digestion, fragmentation of resultant peptides, and the correlation of acquired tandem mass spectra with database sequences, relying upon exact matching algorithms; i.e. the analyzed peptide had to previously exist in a database in silico to be identified. One existing sequence-similarity protein identification method was applied (MS BLAST, Shevchenko 2001) and one alternative novel method was developed (MultiTag), for searching protein and EST databases, to enable the recognition of proteins that are generally unrecognizable by conventional softwares but share significant sequence similarity with database entries (~60-90%). These techniques and available database sequences enabled the characterization of the Xenopus laevis microtubule-associated proteome and the Dunaliella salina soluble salt-induced proteome, both organisms with unsequenced genomes and minimal database sequence resources. These sequence-similarity methods extended protein identification capabilities by more than two-fold compared to conventional methods, making existing methods virtually superfluous. The proteomics of Dunaliella salina demonstrated the utility of MS BLAST as an indispensable method for characterization of proteins in organisms with unsequenced genomes, and produced insight into Dunaliella?s inherent resilience to high salinity. The Xenopus study was the first proteomics project to simultaneously use all three central methods of representation for peptide tandem mass spectra for protein identification: sequence tags, amino acids sequences, and mass lists; and it is the largest proteomics study in Xenopus laevis yet completed, which indicated a potential relationship between the mitotic spindle of dividing cells and the protein synthesis machinery. At the beginning of these experiments, the identification of proteins was conceptualized as using ?conventional? versus ?sequence-similarity? techniques, but through the course of experiments, a conceptual shift in understanding occurred along with the techniques developed and employed to encompass variations in mass spectrometry instrumentation, alternative mass spectrum representation forms, and the complexities of database resources, producing a more systematic description and utilization of available resources for the characterization of proteomes by mass spectrometry and advanced informatic approaches. The experiments demonstrated that proteomics technologies are only as powerful in the field of biology as the biochemical experiments are precise and meaningful

    Protein De novo Sequencing

    Get PDF
    In the proteomic mass spectrometry field, peptide and protein identification can be classified into two categories: database search that relies on existing peptide and protein databases and de novo sequencing with no prior knowledge. There are many unknown protein sequences in nature, especially those proteins that play an vital role in drug development pipelines, such as monoclonal antibodies and venoms. To sequence these unknown proteins, de novo sequencing is a necessity. There have been standard algorithms for de novo sequencing a short peptide from its tandem mass spectrum (MS/MS). However, the de novo sequencing of a whole protein is still in its infancy. The most promising method is to digest the protein into overlapping short peptides with different enzymes. After each peptide is de novo sequenced with MS/MS, these overlapping peptides are then assembled together either manually or with a computer algorithm. Such an automated assembly algorithm becomes the main purpose of this thesis. Compared to the DNA sequence assembly counterpart, the main challenges are the high error rates and the short sequence length of each de novo peptide. To meet these challenges, novel scoring methods and algorithms are proposed and a software program is developed. The program is tested on a standard data set and demonstrates superior performance when compared to the state-of-the-art

    SPIDER: Reconstructive Protein Homology Search with De Novo Sequencing Tags

    Get PDF
    In the field of proteomic mass spectrometry, proteins can be sequenced by two independent yet complementary algorithms: de novo sequencing which uses no prior knowledge and database search which relies upon existing protein databases. In the case where an organism’s protein database is not available, the software Spider was developed in order to search sequence tags produced by de novo sequencing against a database from a related organism while accounting for both errors in the sequence tags and mutations. This thesis further develops Spider by using the concept of reconstruction in order to predict the real sequence by considering both the sequence tags and their matched homologous peptides. The significant value of these reconstructed sequences is demonstrated. Additionally, the runtime is greatly reduced and separated into independent caching and matching steps. This new approach allows for the development of an efficient algorithm for search. In addition, the algorithm’s output can be used for new applications. This is illustrated by a contribution to a complete protein sequencing application

    Developing a bioinformatics framework for proteogenomics

    Get PDF
    In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes. The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence. This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries. This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed

    A perspective toward mass spectrometry-based de novo sequencing of endogenous antibodies

    Get PDF
    A key step in therapeutic and endogenous humoral antibody characterization is identifying the amino acid sequence. So far, this task has been mainly tackled through sequencing of B-cell receptor (BCR) repertoires at the nucleotide level. Mass spectrometry (MS) has emerged as an alternative tool for obtaining sequence information directly at the - most relevant - protein level. Although several MS methods are now well established, analysis of recombinant and endogenous antibodies comes with a specific set of challenges, requiring approaches beyond the conventional proteomics workflows. Here, we review the challenges in MS-based sequencing of both recombinant as well as endogenous humoral antibodies and outline state-of-the-art methods attempting to overcome these obstacles. We highlight recent examples and discuss remaining challenges. We foresee a great future for these approaches making de novo antibody sequencing and discovery by MS-based techniques feasible, even for complex clinical samples from endogenous sources such as serum and other liquid biopsies

    Top-down analysis of protein samples by de novo sequencing techniques

    Get PDF
    Motivation: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. Results: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns. Availability and Implementation: Freely available on the web at http://bioinf.spbau.ru/en/twister

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    Optimized GeLC-MS/MS for Bottom-Up Proteomics

    Get PDF
    Despite tremendous advances in mass spectrometry instrumentation and mass spectrometry-based methodologies, global protein profiling of organellar, cellular, tissue and body fluid proteomes in different organisms remains a challenging task due to the complexity of the samples and the wide dynamic range of protein concentrations. In addition, large amounts of produced data make result exploitation difficult. To overcome these issues, further advances in sample preparation, mass spectrometry instrumentation as well as data processing and data analysis are required. The presented study focuses as first on the improvement of the proteolytic digestion of proteins in in-gel based proteomic approach (Gel-LCMS). To this end commonly used bovine trypsin (BT) was modified with oligosaccharides in order to overcome its main disadvantages, such as weak thermostability and fast autolysis at basic pH. Glycosylated trypsin derivates maintained their cleavage specifity and showed better thermostability, autolysis resistance and less autolytic background than unmodified BT. In line with the “accelerated digestion protocol” (ADP) previously established in our laboratory modified enzymes were tested in in-gel digestion of proteins. Kinetics of in-gel digestion was studied by MALDI TOF mass spectrometry using 18O-labeled peptides as internal standards as well as by label-free quantification approach, which utilizes intensities of peptide ions detected by nanoLC-MS/MS. In the performed kinetic study the effect of temperature, enzyme concentration and digestion time on the yield of digestion products was characterized. The obtained results showed that in-gel digestion of proteins by glycosylated trypsin conjugates was less efficient compared to the conventional digestion (CD) and achieved maximal 50 to 70% of CD yield, suggesting that the attached sugar molecules limit free diffusion of the modified trypsins into the polyacrylamide gel pores. Nevertheless, these thermostable and autolysis resistant enzymes can be regarded as promising candidates for gel-free shotgun approach. To address the reliability issue of proteomic data I further focused on protein identifications with borderline statistical confidence produced by database searching. These hits are typically produced by matching a few marginal quality MS/MS spectra to database peptide sequences and represent a significant bottleneck in proteomics. A method was developed for rapid validation of borderline hits, which takes advantage of the independent interpretation of the acquired tandem mass spectra by de novo sequencing software PepNovo followed by mass-spectrometry driven BLAST (MS BLAST) sequence similarity searching that utilize all partially accurate, degenerate and redundant proposed peptide sequences. It was demonstrated that a combination of MASCOT software, de novo sequencing software PepNovo and MS BLAST, bundled by a simple scripted interface, enabled rapid and efficient validation of a large number of borderline hits, produced by matching of one or two MS/MS spectra with marginal statistical significance

    Bioinformatics of Phosphoproteomics

    Get PDF
    • …
    corecore