Investigating normal human gene expression in tissues with high-throughput transcriptomic and proteomic data.

Abstract

With the improvement of high-throughput technologies during the last decade, several studies exploring the normal gene expression in human tissues have been published. Many studies examine the transcriptome with RNA sequencing (RNA-Seq), and others probe the proteome with unlabelled bottom-up Mass Spectrometry. As the sampling of undiseased tissues is difficult, the community often refers to expression atlases, which are collating these studies, to support or validate new findings. Despite many overlapping tissues between the studies, few atlases attempt to integrate all the data. In this thesis, I investigate the consistency of gene expression across tissues and studies in human with the help of transcriptomics captured with high-throughput sequencing (RNA-Seq) and proteomics generated with label-free bottom-up Mass Spectrometry (MS). After describing the transcriptomic and proteomic data and their state-of-art processing (Chapter 2), I review several identified sources of biases and my approaches to limit their effects (Chapter 3). The integration of the various transcriptomic datasets (Chapter 4) shows that the biological signal dominates the technical noise for RNA-Seq data. Tissue samples display higher levels of correlation for identical tissues in other studies than for other tissues in the same datasets. In other words, interstudy correlations for identical tissues are higher than correlations between different tissues within the same study. Globally, genes show similar expression profiles across studies for a given set of tissues. All genes categories are involved, including the tissue-specific genes and the ubiquitously expressed ones. After briefly discussing comparisons of proteomic data, I introduce a new proteomic quantification method, PPKM (Chapter 5). The PPKM method allows me to quantify about twice as many proteins compared to usual methods. Limited numbers of previous studies have shown various correlation levels between the expression of protein and mRNA in studies combining high-throughput transcriptomics and proteomics. I show that, for most tissues, we can observe quite good correlation levels (i.e. significantly better than expected by chance), even when the samples have different biological and technical backgrounds as they have been independently sourced. Many genes share similar patterns of expression between the two biological layers, e.g. genes that have a protein detected in a single tissue are more likely to have their mRNA showing specificity for the same tissue. Additionally, three groups of genes present functional enrichments of biological processes. Genes having highly correlated protein and mRNA expressions across tissues are enriched in catabolic processes. Genes having the most anticorrelated expressions are enriched for ribosomes and ncRNAs regulation. Genes with a protein detected in a single tissue are enriched in signalling processes. Overall, this thesis describes a global picture of the current consolidated knowledge we can extract from the joint study of public transcriptomic and proteomic data. Beyond confirming or improving observations reported in the literature, this work provides new insights into the ubiquitous and tissue-specific genes. To the best of my knowledge, this work has also established the most extensive list of genes with robust transcriptomic and proteomic expression across tissues and studies. Furthermore, it shows that joint study approaches can help the development of new methods, like the new proteomic PPKM quantification method. Finally, the highlighting of distinct functional enrichment profiles for groups of genes across tissues and studies lays a framework for further research.EMBL International PhD Programm

    Similar works