Evolutionary analyses of orphan genes in mouse lineages in the context of de novo gene birth

Abstract

Gene birth is the process through which new genes appear. For a long time it was argued that the natural way of generating new genes was from copies of existing genes, and the possibility of de novo gene emergence was neglected. However, recent evidence has forced to reconsider old models and de novo gene birth gained recognition as a widespread phenomenon. De novo gene birth is the process by which a non-genic sequence is able to gain gene-like features through few mutations. The following work is a compilation of analyses that seek to highlight the importance and prevalence of de novo gene birth in genomes, suggesting that this is a process that is present at all times and which becomes very relevant upon ecological shifts. In the first chapter, I showed through phylostratigraphic analyses that new genes are substantially simpler than older, a trend which was consistent for several features and organisms, and suggestive of a frequent emergence of new genes through non-duplicative processes. In addition to this, I detected a strong association between gene birth and high transcriptional activity and chromosomal proximity. As part of this work, I was also able to use phylostratigraphy to evaluate a different model of gene birth, overprinting of alternative reading frames. In the following chapters of this dissertation, I made use of high-throughput sequencing of transcriptomes and genomes to ask questions about the origin and change of genes at closer time divergences than ever before, ranging from nearly 3000 years to 10 million years of divergence. I was able to detect the theoretically predicted effects of short time scale comparisons on the rate of protein evolution. Also, I contribute evidence that genes of different ages show different selective constraints even after only a few thousand years of divergence. Finally, in the last part of this thesis I evaluated the role of transcription in gene birth dynamics. Transcription seems to be a predominant feature of genomes, as most of the genome showed some level of transcription. In terms of de novo gene birth, I was able to identify 663 candidate loci from presence and absence of transcription. Analyses of these candidate loci indicated that gains are rather stable, meaning that subsequent losses were rarely found. In agreement with previous studies, I confirmed the role of testis as a driver of new genes. These results indicate that transcription is not a limiting factor in the emergence of new genes, and that our knowledge about the key regulatory elements of transcription and their turnover is still limited to explain why new genes seem to arise at a higher rate than they decay.Contents ......................................................................................................................................... 3 Summary of the thesis .................................................................................................................... 6 Zusammenfassung der Dissertation............................................................................................... 7 Acknowledgements ....................................................................................................................... 10 General introduction..................................................................................................................... 12 A brief historic perspective on the concepts of gene birth .................................................... 12 Gene duplication is the main source of new genes .............................................................. 12 Orphan genes and the genomics era .................................................................................... 14 Phylostratigraphy and the continuous emergence of new genes ......................................... 16 Not all genes come from other genes ................................................................................... 17 Considering gene birth from molecular and evolutionary perspectives ................................... 19 Overprinting: true innovation from existing genes .................................................................... 20 The life cycle of genes .............................................................................................................. 22 Overview................................................................................................................................... 24 Chapter 1: Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution ............................................................................................................................... 26 Introduction............................................................................................................................... 26 Results...................................................................................................................................... 27 Phylostratigraphy of mouse genes ........................................................................................ 27 Genomic features across ages.............................................................................................. 29 Chromosomal distribution ...................................................................................................... 33 Association with transcriptionally active sites ....................................................................... 33 Testis expressed genes......................................................................................................... 35 Alternative reading frames..................................................................................................... 36 Discussion ................................................................................................................................ 39 De novo evolution versus duplication-divergence ................................................................ 40 Regulatory evolution .............................................................................................................. 40 Overprinting ........................................................................................................................... 41 Conclusion................................................................................................................................ 42 Methods .................................................................................................................................... 43 Phylostratigraphy ................................................................................................................... 43 Gene structure analyses........................................................................................................ 43 Transcription associated regions........................................................................................... 44 Expression data for testis ...................................................................................................... 44 Secondary reading frames .................................................................................................... 44 Acknowledgements ................................................................................................................... 45 Chapter 2: Sequencing of genomes and transcriptomes of closely related mouse species....... 46 Introduction............................................................................................................................... 46 Using wild mice to understand gene birth at the transcriptome level ................................... 46 Phylogeographic distribution of the samples ........................................................................ 47 Methods .................................................................................................................................... 49 Biological material.................................................................................................................. 49 Transcriptome sequencing .................................................................................................... 49 Genome sequencing.............................................................................................................. 49 Raw data processing ............................................................................................................. 50 Transcriptome read mapping, annotation and quantification................................................ 50 Genome read mapping .......................................................................................................... 51 Available resources ................................................................................................................... 51 Chapter 3: Differential selective constrains across phylogenetic ages and their impact on the turnover of protein-coding genes. ................................................................................................. 53 Introduction............................................................................................................................... 53 Methods .................................................................................................................................... 53 Transcriptome assembly ....................................................................................................... 53 Generation of ortholog pairs and rate analyses .................................................................... 54 Overlapping genes................................................................................................................. 54 Reading frame polymorphism detection and annotation ...................................................... 55 Statistical analyses ................................................................................................................ 55 Results...................................................................................................................................... 55 Rate differences between genes of different ages ............................................................... 55 Overlapping genes are an unlikely source of bias ................................................................ 57 Impact of reading frame polymorphisms across phylogenetic time...................................... 59 Discussion ................................................................................................................................ 64 Acknowledgements ................................................................................................................... 66 Chapter 4: A transcriptomics approach to the gain and loss of de novo genes in mouse lineages...................................................................................................................................................... 67 Introduction............................................................................................................................... 67 How is a gene made? ............................................................................................................ 67 The early phase of new gene emergence............................................................................. 69 Pervasive transcription and junk-DNA as raw material for new genes ................................ 70 Methods .................................................................................................................................... 71 Transcriptome presence/absence matrix and mapping of gains and losses ....................... 71 Results...................................................................................................................................... 73 How much of the mouse genome has evidence of transcription? ........................................ 73 Genome-wide transcription: gain and loss dynamics ........................................................... 74 Phylogenetic patterns in genome-wide transcription ............................................................ 75 How much of the genome is transcribed in a lineage specific way? .................................... 77 Identification of cases of de novo transcripts ........................................................................ 81 Quantification of gain rates for curated genes ...................................................................... 84 What are the dynamics of transcription loss in known genes?............................................. 86 Where are new genes expressed?........................................................................................ 88 Discussion ................................................................................................................................ 89 Pervasive transcription can provide material for new genes ................................................ 89 Asymmetry in gains and losses of transcription.................................................................... 92 From transcribed protogenes to de novo genes ................................................................... 93 Differences in expression levels ............................................................................................ 95 Testis as a niche for new genes ............................................................................................ 95 Conclusion................................................................................................................................ 96 Concluding remarks ...................................................................................................................... 97 Perspectives................................................................................................................................. 98 References ................................................................................................................................... 99 Chapter contributions .................................................................................................................. 114 Appendices ................................................................................................................................ 115 Appendix A. Phylostratigraphic maps ..................................................................................... 115 Appendix B. Curation data from orphan genes ...................................................................... 115 Appendix C. Functional annotation clusters based on known genes with loss of expression ................................................................................................................................................ 117 Appendix D. Transcriptome information and statistics ........................................................... 118 Curriculum Vitae.......................................................................................................................... 119 Affidavit....................................................................................................................................... 12

    Similar works