8 research outputs found

    Emergence, Retention and Selection: A Trilogy of Origination for Functional <i>De Novo</i> Proteins from Ancestral LncRNAs in Primates

    No full text
    <div><p>While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific <i>de novo</i> genes and report a mechanism for the origination of functional <i>de novo</i> proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated <i>de novo</i> proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age, due to their GC-rich sequence property enabling stable ORFs with lower chance of non-sense mutations. Interestingly, although the emergence and retention of these <i>de novo</i> genes are likely driven by neutral forces, population genetics study in 67 human individuals and 82 macaque animals revealed signatures of purifying selection on these genes specifically in human population, indicating a proportion of these newly-originated proteins are already functional in human. We thus propose a mechanism for creation of functional <i>de novo</i> proteins from ancestral lncRNAs during the primate evolution, which may contribute to human-specific genetic novelties by taking advantage of existed genomic contexts.</p></div

    <i>De novo</i> protein-coding genes originating from lncRNAs.

    No full text
    <p>(<b>A</b>) Computational pipeline for <i>ab inito</i> identification and meta-analysis of <i>de novo</i> genes in the hominoid lineage. (<b>B</b>) Number of <i>de novo</i> genes on the phylogenetic tree, with the branch length proportional to the divergence time. (<b>C</b>) Stacked histogram showing the percentage of <i>de novo</i> gene orthologs that also show expression in chimpanzee or rhesus macaque. (<b>D</b>) Boxplot showing relative expression levels of the transcripts and their nearby regions corresponding to <i>de novo</i> genes (orthologs) in human (chimpanzee or macaque). The nearby regions are defined as upstream and downstream regions with equal length to the corresponding genes. For each region, the relative expression was calculated by normalizing the expression level of this region with the sum of the expression levels of the genic region and the nearby regions. (<b>E</b>) Percentage of splicing junctions with supporting RNA-Seq reads in human, chimpanzee and rhesus macaque. (<b>F</b>) For each pair of tissues, <i>Spearman</i> correlation coefficients were computed separately, and the extent of tissue-specific differences in <i>de novo</i> gene expressions are shown (based on the color scale). Dotted lines highlight parallel comparisons between two different species.</p

    Profiling of polymorphisms in human and rhesus macaque.

    No full text
    <p><b>(A)</b> Comparison of human polymorphism sites profiled in this study with those in the 1000 Genomes Project. (<b>B</b>) The sequencing coverages of whole genome sequencing from one macaque animal and for the targeted re-sequencing of 82 macaque animals are summarized in green barplot and heatmaps inside the <i>Circos</i> map, respectively. The depths of the sequencing coverage are proportional to the color depth. Black rectangles outside the colored chromosome block represent the genomic locations of macaque orthologous regions of human <i>de novo</i> genes. The bottom panel illustrates the sequencing details of one region of interest. (<b>C</b>) Cumulative frequency of mean sequencing coverage on different genic regions of <i>de novo</i> genes is shown. Intergenic regions: 1-kb regions upstream and downstream of the gene. (<b>D, E</b>) Venn diagrams showing the distributions of macaque polymorphism sites identified by whole-genome sequencing and targeted re-sequencing, in terms of polymorphism sites (<b>D</b>) and genotypes (<b>E</b>).</p

    Evidence of purifying selection on the human <i>de novo</i> genes.

    No full text
    <p><b>(A)</b> Comparison of π in different genomic regions. The values were normalized with that of intronic regions. <b>(B)</b> The ratios of π for non-synonymous sites to synonymous sites for <i>de novo</i> genes or orthologs in rhesus macaque were summarized in boxplots (<b>C-F</b>) Derived allele spectra for <i>de novo</i> genes (<b>C</b>), protein-coding genes (<b>E</b>) and lncRNAs (<b>F</b>) in human, as well as for the macaque regions orthologous to the human <i>de novo</i> genes (<b>D</b>) are shown. The standard deviations estimated by 1,000 bootstrap replicates are indicated by the error bars.</p

    Emergence of human <i>de novo</i> proteins from GC-rich lncRNA precursors.

    No full text
    <p><b>(A)</b> GC contents for randomly-selected intergenic regions, all lncRNAs and lncRNA precursors in rhesus macaque are summarized in boxplots. (<b>B</b>) GC contents of different genomic regions are shown for <i>de novo</i> genes in human, as well as the orthologous non-coding regions in chimpanzee and rhesus macaque. For lncRNA precursors, the <i>pseudo</i>-CDS and <i>pseudo</i>-UTR regions were defined according to the orthologous relationship with the corresponding CDS and UTR regions of human <i>de novo</i> proteins. (<b>C</b>) GC contents for CDS regions of RefSeq proteins and <i>de novo</i> proteins in human are summarized in boxplots. A: all <i>de novo</i> genes, Y: younger <i>de novo</i> genes, O: older <i>de novo</i> genes. <b>(D)</b> Boxplot showing the distribution of fragile codon composition of <i>de novo</i> genes and RefSeq proteins in human. <b>(E)</b> Boxplot showing the distribution of half-life time of <i>de novo</i> genes and RefSeq proteins in human. (<b>F</b>) Dot plot showing the survival probability of the <i>de novo</i> ORFs. The probability of 0.05 was marked by red dashed line.</p

    <i>De novo</i> proteins originate from lncRNAs precursors irrespectively of their functional status at RNA level.

    No full text
    <p><b>(A)</b> Flow chart showing the computational pipeline of lncRNAome identification in rhesus macaque. (<b>B, C</b>) For lncRNAs identified in this study, the distribution of distances between 5’ end of lncRNAs and the nearest annotated transcript start site (TSS) (<b>B</b>) or CpG island (<b>C</b>) are shown. The numbers of TSS and CpG islands within 1-kb of the transcripts are shown in the inserted histograms. Annotated genes and randomly selected intergenic sites are also shown as positive and negative controls, respectively. (<b>D</b>) On the basis of population genetics data in rhesus macaque, the distribution of π for synonymous sites (<b><i>Syn Sites</i></b>), non-synonymous sites (<b><i>Nonsyn Sites</i></b>), all lncRNAs, lncRNA precursors and non-coding genes (<b><i>Functional</i></b>) are summarized in boxplots. <i>NS</i>: not significant, **<i>p</i>-value <0.01.</p
    corecore