70 research outputs found

    Cancer proteogenomics : connecting genotype to molecular phenotype

    Get PDF
    The central dogma of molecular biology describes the one-way road from DNA to RNA and finally to protein. Yet, how this flow of information encoded in DNA as genes (genotype) is regulated in order to produce the observable traits of an individual (phenotype) remains unanswered. Recent advances in high-throughput data, i.e., ‘omics’, have allowed the quantification of DNA, RNA and protein levels leading to integrative analyses that essentially probe the central dogma along all of its constituent molecules. Evidence from these analyses suggest that mRNA abundances are at best a moderate proxy for proteins which are the main functional units of cells and thus closer to the phenotype. Cancer proteogenomic studies consider the ensemble of proteins, the so-called proteome, as the readout of the functional molecular phenotype to investigate its influence by upstream events, for example DNA copy number alterations. In typical proteogenomic studies, however, the identified proteome is a simplification of its actual composition, as they methodologically disregard events such as splicing, proteolytic cleavage and post-translational modifications that generate unique protein species – proteoforms. The scope of this thesis is to study the proteome diversity in terms of: a) the complex genetic background of three tumor types, i.e. breast cancer, childhood acute lymphoblastic leukemia and lung cancer, and b) the proteoform composition, describing a computational method for detecting protein species based on their distinct quantitative profiles. In Paper I, we present a proteogenomic landscape of 45 breast cancer samples representative of the five PAM50 intrinsic subtypes. We studied the effect of copy number alterations (CNA) on mRNA and protein levels, overlaying a public dataset of drug- perturbed protein degradation. In Paper II, we describe a proteogenomic analysis of 27 B-cell precursor acute lymphoblastic leukemia clinical samples that compares high hyperdiploid versus ETV6/RUNX1-positive cases. We examined the impact of the amplified chromosomes on mRNA and protein abundance, specifically the linear trend between the amplification level and the dosage effect. Moreover, we investigated mRNA-protein quantitative discrepancies with regard to post-transcriptional and post-translational effects such as mRNA/protein stability and miRNA targeting. In Paper III, we describe a proteogenomic cohort of 141 non-small cell lung cancer clinical samples. We used clustering methods to identify six distinct proteome-based subtypes. We integrated the protein abundances in pathways using protein-protein correlation networks, bioinformatically deconvoluted the immune composition and characterized the neoantigen burden. In Paper IV, we developed a pipeline for proteoform detection from bottom-up mass- spectrometry-based proteomics. Using an in-depth proteomics dataset of 18 cancer cell lines, we identified proteoforms related to splice variant peptides supported by RNA-seq data. This thesis adds on the previous literature of proteogenomic studies by analyzing the tumor proteome and its regulation along the flow of the central dogma of molecular biology. It is anticipated that some of these findings would lead to novel insights about tumor biology and set the stage for clinical applications to improve the current cancer patient care

    Developing a bioinformatics framework for proteogenomics

    Get PDF
    In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes. The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence. This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries. This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed

    Improving anti-cancer therapies through a better identification and characterization of non-canonical MHC-I associated peptides

    Full text link
    Increasing evidence of non-canonical protein translation has sparked interest in their identification and characterization for use in immunotherapy. In addition, recent studies on the repertoire of major histocompatibility complex class I (MHC-I) associated peptides (MAPs or immunopeptidome), have suggested that MAPs derived from these translations are potential targets for cancer immunotherapy. Therefore, the aim of this study was to assess the impact of these MAPs in cancer by developing methods to facilitate their identification and their validation as potential targets for immunotherapy. To facilitate the identification of non-canonical proteins, we developed Ribo-db, a proteogenomic approach that combines RNA sequencing, ribosome profiling and mass spectrometry. This approach enables the generation of specific databases aimed at including protein diversity. The use of Ribo-db to analyze diffuse large B-cell lymphoma (DLBCL) samples revealed that approximately 10% of MAPs were derived from non-canonical proteins. These proteins had distinct properties compared to those derived from canonical proteins. They had shorter lengths and lower stability, but greater efficiency in generating MAPs. Importantly, we found limited overlap between the non-canonical proteins detected in the immunopeptidome and those detected in the whole proteome suggesting the existence of two distinct non-canonical protein repertoires. Knowing that non-canonical MAPs can be effective targets for cancer immunotherapy, we developed BamQuery, a tool to assess their expression in tissues to determine whether they can be used in a vaccine. BamQuery aims to predict the probability of MHC-I presentation of each peptide in different tissues based on its RNA expression. Using BamQuery, we found that previously identified tumor antigens (TA) would be highly expressed in healthy tissues, making them poor candidates for immunotherapy. In addition, we also identified highly potential immunotherapeutic targets in DLBCL that were derived from non-canonical translations. These targets showed promising as they were poorly expressed in normal tissues but highly expressed and shared in tumor samples. Thus, BamQuery proved to be a useful tool for identifying and prioritizing potential immunotherapeutic targets. Overall, our research indicated that non-canonical regions of the genome increase the diversity of MAPs that can be recognized by T cells. Furthermore, the expression of MAPs in tissues can be used as a predictor of their presentation to MHC I to identify reliable targets for immunotherapy, for which BamQuery is an effective tool.Les preuves de plus en plus nombreuses de la traduction des protéines non canonique ont suscité l'intérêt pour leur identification et leur caractérisation en vue de leur utilisation dans les immunothérapies. En outre, des études récentes sur le répertoire des peptides associés au complexe majeur d'histocompatibilité de classe I (CMH-I, connus sous le nom de MAPs ou immunopeptidome), ont suggéré que les MAPs dérivés de ces traductions sont des cibles potentielles pour l'immunothérapie du cancer. L'objectif de cette étude était donc d'évaluer l'impact de ces MAP dans le cancer en développant des méthodes pour faciliter leur identification et leur validation en tant que cibles potentielles pour l'immunothérapie. Afin de faciliter l'identification des protéines non canoniques, nous avons développé Ribodb, une approche protéogénomique qui combine le séquençage de l'ARN, le profilage ribosomal et la spectrométrie de masse. Cette approche permet de générer des bases de données spécifiques visant à inclure la diversité des protéines. Notre analyse avec Ribo-db d'échantillons de lymphome diffus à grandes cellules B (DLBCL) a révélé qu'environ 10% des MAP étaient dérivés de protéines non canoniques. Ces protéines avaient des propriétés distinctes par rapport à celles dérivées de protéines canoniques. Elles étaient plus courtes et avaient une stabilité plus faible, mais une plus grande efficacité dans la génération de MAPs. Fait important, nous avons constaté un chevauchement limité entre les protéines non canoniques détectées dans l'immunopeptidome et celles détectées dans le proteome entier, ce qui suggère l'existence de deux répertoires distincts de protéines non canoniques. Sachant que les MAP non canoniques peuvent être des cibles efficaces pour l'immunothérapie du cancer, nous avons développé BamQuery, un outil permettant d'évaluer leur expression dans les tissus afin de déterminer s'ils peuvent être utilisés dans un vaccin. BamQuery vise à prédire la probabilité de présentation au CMH-I de chaque MAP dans différents tissus sur la base de son expression ARN. En utilisant BamQuery, nous avons découvert que des antigènes tumoraux (TA) précédemment identifiés seraient fortement exprimés dans les tissus sains, ce qui en fait de mauvais candidats pour l'immunothérapie. En outre, nous avons également ii identifié des cibles immunothérapeutiques très potentielles dans DLBCL qui étaient dérivées de traductions non canoniques. Ces cibles se sont révélées prometteuses car elles étaient peu exprimées dans les tissus normaux mais fortement exprimées et partagées dans les échantillons tumoraux. Ainsi, BamQuery s'est avéré être un outil utile pour identifier et hiérarchiser les cibles immunothérapeutiques potentielles. Dans l'ensemble, nos recherches ont indiqué que les régions non canonique du génome augmentent la diversité des MAPs qui peuvent être reconnues par les cellules T. De plus, l'expression des MAPs dans les tissus peut être utilisée comme un prédicteur de leur présentation au CMH I afin d'identifier des cibles fiables pour l'immunothérapie, ce pour quoi BamQuery est un outil efficace

    Context-based analysis of mass spectrometry proteomics data

    Get PDF

    Identification, organisation and visualisation of complete proteomes in UniProt throughout all taxonomic ranks :|barchaea, bacteria, eukatyote and virus

    Get PDF
    Users of uniprot.org want to be able to query, retrieve and download proteome sets for an organism of their choice. They expect the data to be easily accessed, complete and up to date based on current available knowledge. UniProt release 2012_01 (25th Jan 2012) contains the proteomes of 2,923 organisms; 50% of which are bacteria, 38% viruses, 8% eukaryota and 4% archaea. Note that the term 'organism' is used in a broad sense to include subspecies, strains and isolates. Each completely sequenced organism is processed as an independent organism, hence the availability of 38 strain-specific proteomes Escherichia coli that are accessible for download. There is a project within UniProt dedicated to the mammoth task of maintaining the “Proteomes database”. This active resource is essential for UniProt to continually provide high quality proteome sets to the users. Accurate identification and incorporation of new, publically available, proteomes as well as the maintenance of existing proteomes permits sustained growth of the proteomes project. This is a huge, complicated and vital task accomplished by the activities of both curators and programmers. This thesis explains the data input and output of the proteomes database: the flow of genome project data from the nucleotide database into the proteomes database, then from each genome how a proteome is identified, augmented and made visible to uniprot.org users. Along this journey of discovery many issues arose, puzzles concerning data gathering, data integrity and also data visualisation. All were resolved and the outcome is a well-documented, actively maintained database that strives to provide optimal proteome information to its users

    Development and Application of Next-Generation Sequencing Methods to Profile Cellular Translational Dynamics

    Full text link
    The transmission of genetic information from the transcription of DNA to RNA and the subsequent translation of RNA into protein is often abstracted into a linear process. However, as methods and technologies to measure the genomic, transcriptomic, and proteomic content of cells have advanced, so too has our understanding that the transmission of genetic information does not always flow in a lossless manner. For instance, changes observed in messenger RNA (mRNA) abundance are not always retained at the proteomic level. Indeed, a diverse array of mechanisms have been identified that exert regulatory control over this transmission of information. Next-generation short read sequencing has driven many of these insights and provided increasingly nuanced understanding of these regulatory mechanisms. However, the continued development and application of sequencing methodologies and analytics are required to properly contextualize many of these insights on a more global scale. Ribosome profiling is one such recent advancement which enriches for ribosome-protected fragments of mRNA; sequencing and analysis of these ribosome-protected mRNA fragments enables profiling of the translational content of a sample. The aim of this dissertation is to address the need for the development and application of statistical and analytical algorithms to profile the regulatory factors that contribute to the translational dynamics in cells. In the first chapter, I survey the development and application of next-generation sequencing methods for the profiling and computational analysis of translation and translational dynamics. In the second chapter of this thesis, I present SPECtre, a software package that identifies regions of active translation through measurement of the translational engagement of ribosomes over a transcript. SPECtre achieves high sensitivity and specificity in its classification of regions undergoing translation by leveraging the codon-dependent elongation of peptides; this tri-nucleotide periodicity is evident in the alignment of ribosome profiling sequence reads to a reference transcriptome. SPECtre classifies actively translated transcripts according to their coherence in read coverage over a region to an optimal tri-nucleotide signal. In the third chapter, I describe the application of SPECtre to identify the translation of upstream-initiated open-reading frames that may regulate differentiation in a neuron-like cell model. uORFs are transcripts that result from the initiation of translation from AUG, and under certain biological constraints, from non-AUG sequences localized in the 5’ untranslated regions of annotated protein-coding genes. Subsets of these uORFs have been implicated in the regulation of their downstream protein-coding genes in yeast, mice and humans. In this chapter, I provide further evidence for this regulation as well as the spatial context for the functional consequences of uORF translation on downstream protein-coding genes in a neuron-like cell line model of differentiation. Finally, in the fourth chapter, I outline a strategy using our coherence-based translational scoring algorithm to profile ribosomal engagement over chimeric gene fusion breakpoints in prostate cancer. Here, known breakpoints from current annotation databases are integrated with novel junctions nominated by existing whole genome and transcriptomic gene fusion detection algorithms, and the translational profile over these chimeric junctions using SPECtre is measured. This provides an additional layer of translational evidence to known and novel gene fusion breakpoints in prostate cancer. Ongoing development of a database and visualization platform based on these results will enable integrative insights into the transcriptional and translational topology of these breakpoints.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144106/1/stonyc_1.pd
    corecore