45 research outputs found

    Machine learning applications in proteomics research: How the past can boost the future

    Get PDF
    Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.acceptedVersio

    Computational methods and tools for protein phosphorylation analysis

    Get PDF
    Signaling pathways represent a central regulatory mechanism of biological systems where a key event in their correct functioning is the reversible phosphorylation of proteins. Protein phosphorylation affects at least one-third of all proteins and is the most widely studied posttranslational modification. Phosphorylation analysis is still perceived, in general, as difficult or cumbersome and not readily attempted by many, despite the high value of such information. Specifically, determining the exact location of a phosphorylation site is currently considered a major hurdle, thus reliable approaches are necessary for the detection and localization of protein phosphorylation. The goal of this PhD thesis was to develop computation methods and tools for mass spectrometry-based protein phosphorylation analysis, particularly validation of phosphorylation sites. In the first two studies, we developed methods for improved identification of phosphorylation sites in MALDI-MS. In the first study it was achieved through the automatic combination of spectra from multiple matrices, while in the second study, an optimized protocol for sample loading and washing conditions was suggested. In the third study, we proposed and evaluated the hypothesis that in ESI-MS, tandem CID and HCD spectra of phosphopeptides can be accurately predicted and used in spectral library searching. This novel strategy for phosphosite validation and identification offered accuracy that outperformed the other currently existing popular methods and proved applicable to complex biological samples. And finally, we significantly improved the performance of our command-line prototype tool, added graphical user interface, and options for customizable simulation parameters and filtering of selected spectra, peptides or proteins. The new software, SimPhospho, is open-source and can be easily integrated in a phosphoproteomics data analysis workflow. Together, these bioinformatics methods and tools enable confident phosphosite assignment and improve reliable phosphoproteome identification and reportin

    Mining Deeper into the Proteome: Computational Strategies for Improving Depth and Breadth of Coverage in High-Throughput Protein Identification Studies.

    Full text link
    The proteomics field is driven by the need to develop increasingly high-throughput methods for the identification and characterization of proteins. The overall goal of this research is to improve the success rate of modern high-throughput proteomics studies. The focus is on developing computational strategies for increasing the number of identifications as well as improving the ability to distinguish new forms of proteins and peptides. Several studies are presented, addressing different points in the proteomics analysis pipeline. At the most fundamental data analysis level, methods for using modern machine learning algorithms to improve the ability to distinguish correct from incorrect peptide identifications are presented. These techniques have the potential to minimize the need for manual curation of results, providing a significant increase in throughput in addition to increased identification confidence. Non-standard types of mass spectrometry data are being generated in specific contexts. Specifically, phosphoproteomics often involves the generation of MS3 spectra. These spectra alleviate problems associated with MS2 fragmentation of phosphopeptides, but utilizing the additional information contained in these spectra requires novel informatics. Several strategies for accommodating this additional information are presented. A statistical model is developed for translating the information contained in the coupling of consecutive MS2 and MS3 spectra into a more accurate peptide identification probability score. Also, methods for combining MS2 and MS3 data are explored. A newer mass spectrometry methodology useful for phosphoproteomics has recently been introduced as well, termed multistage activation (MSA). A comparative study of this and other methods is presented aimed at determining an optimal method for generating phosphopeptide identifications, focusing not only on data analysis techniques, but also on the mass spectrometry methodologies themselves. A dataset is presented from a differential study of a human cell line infected with the dengue virus. The study explores the complementarity of different fractionation methods in generating more unique protein identifications. A discussion of a statistical mixture model that utilizes relative quantification information to classify identified peptides into two categories based on their membrane topology is given in the final chapter. Finally, a comment on utilizing pI information to enrich for phosphopeptides is provided.Ph.D.BioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58496/1/pulintz_1.pd

    Evaluation of the relevance and impact of kinase dysfunction in neurological disorders through proteomics and phosphoproteomics bioinformatics

    Get PDF
    Phosphorylation is an important post-translational modification that is involved in various biological processes and its dysregulation has in particular been linked to diseases of the central nervous system including neurological disorders. The present thesis characterizes alterations in the phosphoproteome and protein abundance associated with schizophrenia and Parkinson's disease, with the goal of uncovering the underlying disease mechanisms. To support this goal, I eventually created an automated analysis pipeline in R to streamline the analysis process of proteomics and phosphoproteomics data. Mass spectrometry (MS) technology is utilized to generate proteomics and phosphoproteomics data. Study I of the thesis demonstrates an automated R pipeline, PhosPiR, created to perform multi-level functional analyses of MS data after the identification and quantification of the raw spectral data. The pipeline does not require coding knowledge to run. It supports 18 different organisms, and provides analyses of MS intensity data from preprocessing, normalization and imputation, through to figure overviews, statistical analysis, enrichment analysis, PTM-SEA, kinase prediction and activity analysis, network analysis, hub analysis, annotation mining, and homolog alignment. The LRRK2-G2019S mutation, a frequent genetic cause of late onset Parkinson's disease, was investigated in Study II and III. One study investigated the mechanism of LRRK2-G2019S function in brain, and the other identified proteins with significantly altered overall translation patterns in sporadic and LRRK2-G2019S patient samples. Specifically, study II identified that LRRK2 is localized to the small 40S ribosomal subunit and that LRRK2 activity suppresses RNA translation, as validated in cell and animal models of Parkinson's disease and in patient cells. Study III utilized bio-orthogonal non-canonical amino acid tagging to label newly translated proteins in order to identify which proteins were affected by repressed translation in patient samples, using mass spectrometry analysis. The analysis revealed 33 and 30 nascent proteins with reduced synthesis in sporadic and LRRK2-G2019S Parkinson’s cases, respectively. The biological process "cytosolic signal recognition particle (SRP)-dependent co-translational protein targeting to membrane" was functionally significantly affected in both sporadic and LRRK2-G2019S Parkinson's, while "Tubulin/FTsz C-terminal domain superfamily network" was only significantly enriched in LRRK2-G2019S Parkinson’s cases. The findings were validated bytargeted proteomics and immunoblotting. Study IV is conducted to investigate the role of JNK1 in schizophrenia. Wild type and Jnk1-/- mice were used to analyze the phosphorylation profile using LC-MS/MS analysis. 126 proteins associated with schizophrenia were identified to overlap with the significantly differentially phosphorylated proteins in Jnk1-/- mice brain. The NMDAR trafficking pathway was found to be highly enriched, and surface staining of NMDAR subunits in neurons showed that surface expression of both subunits in Jnk1-/- neurons was significantly decreased. Further behavioral tests conducted with MK801 treatment have associated the Jnk1-/- molecular and behavioral phenotype with schizophrenia and neuropsychiatric disease

    Integrative Analysis Frameworks for Improved Peptide and Protein Identifications from Tandem Mass Spectrometry Data.

    Full text link
    Tandem mass spectrometry (MS/MS) followed by database search is the method of choice for high throughput protein identification in modern proteomic studies. Database searching methods employ spectral matching algorithms and statistical models to identify and quantify proteins in a sample. The major focus of these statistical methods is to assign probability scores to the identifications to distinguish between high confidence, reliable identifications that may be accepted (typically corresponding to a false discovery rate, FDR, of 1% or 5%) and lower confidence, spurious identifications that are rejected. These identification probabilities are determined, in general, considering only evidence from the MS/MS data. However, considering the wealth of external (orthogonal) data available for most biological systems, integrating such orthogonal information into proteomics analysis pipelines can be a promising approach to improve the sensitivity of these analysis pipelines and rescue true positive identifications that were rejected for want of sufficient evidence supporting their presence. In this dissertation, approaches based on naive bayes rescoring, search space restriction, and a hybrid approach that combines both are described for integrating orthogonal information in proteomic analysis pipelines. These methods have been applied for integrating transcript abundance data from RNA-seq and identification frequency data from the Global Proteome Machine database, GPMDB (one of the largest repositories of proteomic experiment results), into analysis pipelines, improving the number of peptide and protein identifications from MS/MS data. Further, estimation of false discovery rates in very large proteomic datasets was also investigated. In very large datasets, usually resulting from integrating data from multiple experiments, some assumptions used in typical target-decoy based FDR estimation in smaller datasets no longer hold true, resulting in artificially inflated error rates. Alternative approaches that would allow accurate FDR estimation in these large scale datasets have been described and benchmarked.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116717/1/avinashs_1.pd

    Methodenentwicklung in der Qualitativen, Quantitativen und Computergestützten Proteomforschung

    Get PDF
    Protein phosphorylation is an important posttranslational modification that plays a regulatory role within numerous biological processes. The simultaneous identification, localization, and quantification of phosphorylated proteins is vital for understanding this dynamic control mechanism. The application of isobaric labeling strategies, e.g., iTRAQ, for quantitative phosphopeptide analysis requires (i) optimal peptide fragmentation conditions, (ii) sophisticated computational proteomics algorithms to identify (phosphorylated) iTRAQ labeled peptides, and (iii) the ability to use more than one Peptide-Spectra-Match and phosphopeptide sequence to guarantee accurate phosphorylation specific quantification. These three demands were combined into a platform to relatively quantify iTRAQ-4Plex labeled (phospho)proteins on a LTQ Orbitrap Velos.Die Protein Phosphorylierung stellt eine wichtige posttranslationale Modifikation dar, die eine Vielzahl biologischer Prozesse reguliert. Um die dynamischen Kontrollmechanismen besser verstehen zu können, ist es wichtig phosphorylierte Proteine identifizieren und quantifizieren zu können. Die Anwendung isobarer Derivatisierungsstrategien (z.B. iTRAQ) zur quantitativen Phosphopeptid-Analyse erfordert (i) optimal eingestellte Peptid-Fragmentierungsbedingungen und (ii) speziell angepasste Algorithmen zur Identifizierung iTRAQ derivatisierter (phosphorylierter) Peptide. Weiterhin (iii) wird eine Vielzahl an Peptide-Spectrum-Matches bzw. Phosphopeptid spezifischer Sequenzen für eine akkurate Quantifizierung benötigt. Diese drei Anforderungen wurden in einer Plattform vereinigt, um iTRAQ derivatisierte (Phospho)Proteine mittels massenspektrometrischer Analyse auf einer LTQ Orbitrap Velos relativ zueinander quantifizieren zu können

    Mass spectrometry based proteomics : data analysis and applications

    Get PDF
    Mass spectrometry (MS) based proteomics has become a widely used high throughput method to investigate protein expression and functional regulation. From being able to study only dozens of proteins, state-of-art MS proteomic techniques are now able to identify and quantify ten thousand proteins. Nevertheless, MS proteomics are facing problems investigating protein variants derived from alternative splicing, detecting peptides from novel coding sequences, identifying peptide variants from genetic changes and statistical analysis of quantitative proteome. The work present in this thesis start from these problems and contribute solutions to them. In standard shotgun proteomics studies, protein identifications are inferred from a list of identified peptides using Occam Razor’s rule, which outputs a minimum list of proteins sufficient to explain peptide evidences. The protein inference process creates a potential problem in protein level quantification, resulting mixture of quantitative signals from different splice variants if the inferred proteins do not correctly represent the peptide populations. Paper I present a tool to investigate splice variants using MS proteomics data. By clustering the quantitative pattern of peptides and showing their transcript positions, it is able to reveal splice variants specific peptides with different quantitative signal. The tool was applied to a comprehensive proteomics data of A431 cells treated with Gefitinib (EGFR inhibitor). For certain genes, we observed splice-variant-centric quantification differs from traditional proteincentric or gene-centric quantification, suggesting differentially regulated splice variants after Gefitinib treatment. Previously, MS proteomics has been used to refine genome annotation. However, the applications were limited to validate and confirm predicted gene models. In Paper II, we demonstrate an integrative genome annotation workflow that combines MS proteomics data and RNA-sequencing to perform evidence-based whole genome annotation of a newly sequenced commensal yeast. The workflow showed higher accuracy of protein coding gene annotation compared to conventional way of using only RNA-sequencing data. The study exemplifies that proteomics data used in combination with RNA-seq data is able to produce a more accurate and complete whole genome annotation. Paper III shows an integrative proteogenomics analysis workflow. Compared to standard proteomics which analyzes known proteins in reference database, proteogenomics aims to discover peptides from novel coding sequences and disease relevant mutations. To identify novel coding sequences in well annotated genomes, such as human, it is particular challenging due to several reasons. First, protein-coding sequences in the human genome consists of only 2%-3% of the total sequences. There are approximately one million peptides from known coding genes, and the novel peptides from undiscovered coding loci constitutes a minor part of the total peptide population. That means the vast majority of experimental spectra are produced from known peptides. Identification of peptides with MS proteomics technique relies on correct matching between experimental spectra to in silico generated spectra of the peptides in search space. Detecting of novel peptides requires correct spectra matching for both known and novel peptides, and the process is doomed to produce false positives. Previously, conservative criteria and manual curation has been applied to ensure the quality of findings. Paper III presents a workflow which improves the reliability of proteogenomics findings by automated extensive data curation and evidence searching in orthogonal data. In analysis of the proteomics data of a cancer cell line and five normal human tissues, the workflow successfully detected novel peptides from unknown coding regions and peptide variants from non-synonymous single nucleotide polymorphisms (nsSNPs) and mutations, with multiple sources of evidence provided. Moreover, our quantitative MS data indicated that certain pseudogenes and lncRNAs were expressed and translated in tissue-specific manner. Paper IV addresses the statistical analysis of quantitative proteomics. Currently, there is no consensus in the usage of statistical methods to analyze labelled and label-free proteomics data. One of the main reasons is the lack of statistical tool with high performance, ease to use, and broad applicability to various proteomics datasets. The presented statistical method, DEqMS, is a robust and universal tool to assess differential protein expression for quantitative MS proteomics. DEqMS takes into account the variance dependence on the number of peptides/PSMs used for protein quantification in statistical significance test. Compared to existing methods in several benchmarking datasets, DEqMS was demonstrated with both high statistical accuracy and general applicability. In summary, the work included in this thesis contributes with improved data interpretation and applications of MS proteomics data in analysis of splice variants, genome annotation, proteogenomics studies and statistical analysis of protein expression changes. Development of these methods facilitate a wide range of applications of MS proteomics data in the systems biology researc

    Computational approaches in high-throughput proteomics data analysis

    Get PDF
    Proteins are key components in biological systems as they mediate the signaling responsible for information processing in a cell and organism. In biomedical research, one goal is to elucidate the mechanisms of cellular signal transduction pathways to identify possible defects that cause disease. Advancements in technologies such as mass spectrometry and flow cytometry enable the measurement of multiple proteins from a system. Proteomics, or the large-scale study of proteins of a system, thus plays an important role in biomedical research. The analysis of all high-throughput proteomics data requires the use of advanced computational methods. Thus, the combination of bioinformatics and proteomics has become an important part in research of signal transduction pathways. The main objective in this study was to develop and apply computational methods for the preprocessing, analysis and interpretation of high-throughput proteomics data. The methods focused on data from tandem mass spectrometry and single cell flow cytometry, and integration of proteomics data with gene expression microarray data and information from various biological databases. Overall, the methods developed and applied in this study have led to new ways of management and preprocessing of proteomics data. Additionally, the available tools have successfully been used to help interpret biomedical data and to facilitate analysis of data that would have been cumbersome to do without the use of computational methods.Proteiineilla on tärkeä merkitys biologisissa systeemeissä sillä ne koordinoivat erilaisia solujen ja organismien prosesseja. Yksi biolääketieteellisen tutkimuksen tavoitteista on valottaa solujen viestintäreittejä ja niiden toiminnassa tapahtuvia muutoksia eri sairauksien yhteydessä, jotta tällaisia muutoksia voitaisiin korjata. Proteomiikka on proteiinien laajamittaista tutkimista solusta, kudoksesta tai organismista. Proteomiikan menetelmät kuten massaspektrometria ja virtaussytometria ovat keskeisiä biolääketieteellisen tutkimuksen menetelmiä, joilla voidaan mitata näytteestä samanaikaisesti useita proteiineja. Nykyajan kehittyneet proteomiikan mittausteknologiat tuottavat suuria tulosaineistoja ja edellyttävät laskennallisten menetelmien käyttöä aineiston analyysissä. Bioinformatiikan menetelmät ovatkin nousseet tärkeäksi osaksi proteomiikka-analyysiä ja viestintäreittien tutkimusta. Tämän tutkimuksen päätavoite oli kehittää ja soveltaa tehokkaita laskennallisia menetelmiä laajamittaisten proteomiikka-aineistojen esikäsittelyyn, analyysiin ja tulkintaan. Tässä tutkimuksessa kehitettiin esikäsittelymenetelmä massaspektrometria-aineistolle sekä automatisoitu analyysimenetelmä virtaussytometria-aineistolle. Proteiinitason tietoa yhdistettiin mittauksiin geenien transkriptiotasoista ja olemassaolevaan biologisista tietokannoista poimittuun tietoon. Väitöskirjatyö osoittaa, että laskennallisilla menetelmillä on keskeinen merkitys proteomiikan aineistojen hallinnassa, esikäsittelyssä ja analyysissä. Tutkimuksessa kehitetyt analyysimenetelmät edistävät huomattavasti biolääketieteellisen tiedon laajempaa hyödyntämistä ja ymmärtämistä
    corecore