48 research outputs found
Evolutionary trajectories of new duplicated and putative de novo genes
The formation of new genes during evolution is an important motor of functional innovation, but the rate at which new genes originate and the likelihood that they persist over longer evolutionary periods are still poorly understood questions. Two important mechanisms by which new genes arise are gene duplication and de novo formation from a previously noncoding sequence. Does the mechanism of formation influence the evolutionary trajectories of the genes? Proteins arisen by gene duplication retain the sequence and structural properties of the parental protein, and thus they may be relatively stable. Instead, de novo originated proteins are often species specific and thought to be more evolutionary labile. Despite these differences, here we show that both types of genes share a number of similarities, including low sequence constraints in their initial evolutionary phases, high turnover rates at the species level, and comparable persistence rates in deeper branchers, in both yeast and flies. In addition, we show that putative de novo proteins have an excess of substitutions between charged amino acids compared with the neutral expectation, which is reflected in the rapid loss of their initial highly basic character. The study supports high evolutionary dynamics of different kinds of new genes at the species level, in sharp contrast with the stability observed at later stages.We acknowledge funding from Ministerio de Ciencia e Innovación Agencia Estatal de Investigación grant PGC2018-094091-B-I00 (cofunded by Fondo Europeo de Desarrollo Regional), as well as grants PID2021-122726NB-I00 and PID2021-122830OB-C43 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF: A way of making Europe”, by the “European Union”. We also acknowledge funding from Generalitat de Catalunya, grant 2021SGR00042. The work was also funded by the European Union (ERC, NovoGenePop, project number 101052538).Peer ReviewedPostprint (published version
Zinc-finger domains in metazoans: evolution gone wild
We acknowledge funding from the Ministerio de Economía e Innovación (Spanish Government) co-funded by FEDER (BFU2015-65235-P), and from the Agència de Gestió d'Ajuts Universitaris i de Recerca Generalitat de Catalunya (AGAUR) (2014SGR1121)
Conserved regions in long non-coding RNAs contain abundant translation and protein–RNA interaction signatures
The mammalian transcriptome includes thousands of transcripts that do not correspond to annotated protein-coding genes and that are known as long non-coding RNAs (lncRNAs). A handful of lncRNAs have well-characterized regulatory functions but the biological significance of the majority of them is not well understood. LncRNAs that are conserved between mice and humans are likely to be enriched in functional sequences. Here, we investigate the presence of different types of ribosome profiling signatures in lncRNAs and how they relate to sequence conservation. We find that lncRNA-conserved regions contain three times more ORFs with translation evidence than non-conserved ones, and identify nine cases that display significant sequence constraints at the amino acid sequence level. The study also reveals that conserved regions in intergenic lncRNAs are significantly enriched in protein–RNA interaction signatures when compared to non-conserved ones; this includes sites in well-characterized lncRNAs, such as Cyrano, Malat1, Neat1 and Meg3, as well as in tens of lncRNAs of unknown function. This work illustrates how the analysis of ribosome profiling data coupled with evolutionary analysis provides new opportunities to explore the lncRNA functional landscape
Uncovering adaptive evolution in the human lineage
Background: The recent increase in human polymorphism data, together with the availability of genome sequences from several primate species, provides an unprecedented opportunity to investigate how natural selection has shaped human evolution. Results: We compared human branch-specific substitutions with variation data in the current human population to measure the impact of adaptive evolution on human protein coding genes. The use of single nucleotide polymorphisms (SNPs) with high derived allele frequencies (DAFs) minimized the influence of segregating slightly deleterious mutations and improved the estimation of the number of adaptive sites. Using DAF ≥ 60% we showed that the proportion of adaptive substitutions is 0.2% in the complete gene set. However, the percentage rose to 40% when we focused on genes that are specifically accelerated in the human branch with respect to the chimpanzee branch, or on genes that show signatures of adaptive selection at the codon level by the maximum likelihood based branch-site test. In general, neural genes are enriched in positive selection signatures. Genes with multiple lines of evidence of positive selection include taxilin beta, which is involved in motor nerve regeneration and syntabulin, and is required for the formation of new presynaptic boutons. Conclusions: We combined several methods to detect adaptive evolution in human coding sequences at a genome-wide level. The use of variation data, in addition to sequence divergence information, uncovered previously undetected positive selection signatures in neural genes.This work was financially supported by the Ministerio de Economía y Competitividad from the Spanish Government (Plan Nacional project BFU2012-36820), and Institució Catalana de Recerca i Estudis Avançats (ICREA) from Generalitat de Cataluny
Uncovering adaptive evolution in the human lineage
Background: The recent increase in human polymorphism data, together with the availability of genome sequences from several primate species, provides an unprecedented opportunity to investigate how natural selection has shaped human evolution. Results: We compared human branch-specific substitutions with variation data in the current human population to measure the impact of adaptive evolution on human protein coding genes. The use of single nucleotide polymorphisms (SNPs) with high derived allele frequencies (DAFs) minimized the influence of segregating slightly deleterious mutations and improved the estimation of the number of adaptive sites. Using DAF ≥ 60% we showed that the proportion of adaptive substitutions is 0.2% in the complete gene set. However, the percentage rose to 40% when we focused on genes that are specifically accelerated in the human branch with respect to the chimpanzee branch, or on genes that show signatures of adaptive selection at the codon level by the maximum likelihood based branch-site test. In general, neural genes are enriched in positive selection signatures. Genes with multiple lines of evidence of positive selection include taxilin beta, which is involved in motor nerve regeneration and syntabulin, and is required for the formation of new presynaptic boutons. Conclusions: We combined several methods to detect adaptive evolution in human coding sequences at a genome-wide level. The use of variation data, in addition to sequence divergence information, uncovered previously undetected positive selection signatures in neural genes.This work was financially supported by the Ministerio de Economía y Competitividad from the Spanish Government (Plan Nacional project BFU2012-36820), and Institució Catalana de Recerca i Estudis Avançats (ICREA) from Generalitat de Cataluny
Comparative analysis of amino acid repeats in rodents and humans
Amino acid tandem repeats, also called homopolymeric tracts, are extremely abundant in eukaryotic proteins. To gain insight into the genome-wide evolution of these regions in mammals, we analyzed the repeat content in a large data set of rat-mouse-human orthologs. Our results show that human proteins contain more amino acid repeats than rodent proteins and that trinucleotide repeats are also more abundant in human coding sequences. Using the human species as an outgroup, we were able to address differences in repeat loss and repeat gain in the rat and mouse lineages. In this data set, mouse proteins contain substantially more repeats than rat proteins, which can be at least partly attributed to a higher repeat loss in the rat lineage. The data are consistent with a role for trinucleotide slippage in the generation of novel amino acid repeats. We confirm the previously observed functional bias of proteins with repeats, with overrepresentation of transcription factors and DNA-binding proteins. We show that genes encoding amino acid repeats tend to have an unusually high GC content, and that differences in coding GC content among orthologs are directly related to the presence/absence of repeats. We propose that the different GC content isochore structure in rodents and humans may result in an increased amino acid repeat prevalence in the human lineage
pTINCR microprotein promotes epithelial differentiation and suppresses tumor growth through CDC42 SUMOylation and activation
The human transcriptome contains thousands of small open reading frames (sORFs) that encode microproteins whose functions remain largely unexplored. Here, we show that TINCR lncRNA encodes pTINCR, an evolutionary conserved ubiquitin-like protein (UBL) expressed in many epithelia and upregulated upon differentiation and under cellular stress. By gain- and loss-of-function studies, we demonstrate that pTINCR is a key inducer of epithelial differentiation in vitro and in vivo. Interestingly, low expression of TINCR associates with worse prognosis in several epithelial cancers, and pTINCR overexpression reduces malignancy in patient-derived xenografts. At the molecular level, pTINCR binds to SUMO through its SUMO interacting motif (SIM) and to CDC42, a Rho-GTPase critical for actin cytoskeleton remodeling and epithelial differentiation. Moreover, pTINCR increases CDC42 SUMOylation and promotes its activation, triggering a pro-differentiation cascade. Our findings suggest that the microproteome is a source of new regulators of cell identity relevant for cancer.The authors thank VHIO Proteomics, Molecular Oncology and Genomics Core Facilities for technical assistance. We are grateful to Manuel Serrano for providing several reagents, advice and critical discussion on the manuscript. We also thank Alonso García and Raquel Pérez for their help in processing and analyzing digital images, Gemma Serra and Sandra Peiró for their assistance with subcellular fractionation and immunoprecipitation experiments, Sara Arce and Joaquín Mateo for providing several reagents during the development of critical experiments of this manuscript, and Juan Angel Recio for his help with the cSCC cohort. We are immensely grateful to all the members of the Abad lab for generating the know-how for the identification of novel sORFs, for the critical reading on the manuscript and in general for their constant support to this project. Work in the Abad lab is supported by VHIO, Fero Foundation, La Caixa Foundation, Asociación Española Contra el Cancer (AECC), La Mutua Foundation and by grants from the Spanish Ministry of Science and Innovation (SAF2015-69413-R; RTI2018-102046-B-I00). M.A. was recipient of a Ramón y Cajal contract from the Spanish Ministry of Science and Innovation (RYC-2013-14747). O.B. is recipient of a FPI-AGAUR fellowship from Generalitat de Catalunya. We also acknowledge funding from grant PGC2018-094091-B-I00 from the Spanish Government
Positional bias of general and tissue-specific regulatory motifs in mouse gene promoters
Background: The arrangement of regulatory motifs in gene promoters, or promoter/narchitecture, is the result of mutation and selection processes that have operated over many/nmillions of years. In mammals, tissue-specific transcriptional regulation is related to the presence of/nspecific protein-interacting DNA motifs in gene promoters. However, little is known about the/nrelative location and spacing of these motifs. To fill this gap, we have performed a systematic search/nfor motifs that show significant bias at specific promoter locations in a large collection of/nhousekeeping and tissue-specific genes./nResults: We observe that promoters driving housekeeping gene expression are enriched in/nparticular motifs with strong positional bias, such as YY1, which are of little relevance in promoters/ndriving tissue-specific expression. We also identify a large number of motifs that show positional/nbias in genes expressed in a highly tissue-specific manner. They include well-known tissue-specific/nmotifs, such as HNF1 and HNF4 motifs in liver, kidney and small intestine, or RFX motifs in testis,/nas well as many potentially novel regulatory motifs. Based on this analysis, we provide predictions/nfor 559 tissue-specific motifs in mouse gene promoters./nConclusion: The study shows that motif positional bias is an important feature of mammalian/nproximal promoters and that it affects both general and tissue-specific motifs. Motif positional/nconstraints define very distinct promoter architectures depending on breadth of expression and/ntype of tissue.We received financial support from Fundación/nBanco Bilbao Vizcaya Argentaria (FBBVA), Plan Nacional de I+D Ministerio/nde Educación y Ciencia (BFU2006-07120), Instituto Nacional de Bioinformática/n(INB), European Commission Infobiomed NoE and, Fundació/nICREA
Emergence of novel domains in proteins
Proteins are composed of a combination of discrete, well-defined, sequence domains, associated with specific functions that have arisen at different times during evolutionary history. The emergence of novel domains is related to protein functional diversification and adaptation. But currently little is known about how novel domains arise and how they subsequently evolve. To gain insights into the impact of recently emerged domains in protein evolution we have identified all human young protein domains that have emerged in approximately the past 550 million years. We have classified them into vertebrate-specific and mammalian-specific groups, and compared them to older domains. We have found 426 different annotated young domains, totalling 995 domain occurrences, which represent about 12.3% of all human domains. We have observed that 61.3% of them arose in newly formed genes, while the remaining 38.7% are found combined with older domains, and have very likely emerged in the context of a previously existing protein. Young domains are preferentially located at the N-terminus of the protein, indicating that, at least in vertebrates, novel functional sequences often emerge there. Furthermore, young domains show significantly higher non-synonymous to synonymous substitution rates than older domains using human and mouse orthologous sequence comparisons. This is also true when we compare young and old domains located in the same protein, suggesting that recently arisen domains tend to evolve in a less constrained manner than older domains. We conclude that proteins tend to gain domains over time, becoming progressively longer. We show that many proteins are made of domains of different age, and that the fastest evolving parts correspond to the domains that have been acquired more recently.We received financial support from Ministerio de Educación (FPU to M.T.-R.), Ministerio de Innovación y Tecnología grant BIO2009-08160, Ministerio de Economía y Competitividad grant BFU2012-36820, and Institució Catalana de Recerca i Estudis Avançats (ICREA contract to M.M.A.)
Dissecting the role of low-complexity regions in the evolution of vertebrate proteins
Low-complexity regions (LCRs) in proteins are tracts that are highly enriched in one or a few aminoacids. Given their high abundance, and their capacity to expand in relatively short periods of time through replication slippage, they can greatly contribute to increase protein sequence space and generate novel protein functions. However, little is known about the global impact of LCRs on protein evolution. We have traced back the evolutionary history of 2,802 LCRs from a large set of homologous protein families from H.sapiens, M.musculus, G.gallus, D.rerio and C.intestinalis. Transcriptional factors and other regulatory functions are overrepresented in proteins containing LCRs. We have found that the gain of novel LCRs is frequently associated with repeat expansion whereas the loss of LCRs is more often due to accumulation of amino acid substitutions as opposed to deletions. This dichotomy results in net protein sequence gain over time. We have detected a significant increase in the rate of accumulation of novel LCRs in the ancestral Amniota and mammalian branches, and a reduction in the chicken branch. Alanine and/or glycine-rich LCRs are overrepresented in recently emerged LCR sets from all branches, suggesting that their expansion is better tolerated than for other LCR types. LCRs enriched in positively charged amino acids show the contrary pattern, indicating an important effect of purifying selection in their maintenance. We have performed the first large-scale study on the evolutionary dynamics of LCRs in protein families. The study has shown that the composition of an LCR is an important determinant of its evolutionary pattern.We received financial support from Fundación Javier Lamas (PhD fellowship to N.R-T.), Ministerio de Innovación y Tecnología at the Spanish Government (BIO2009-08160) and Institució Catalana de Recerca i Estudis Avançats (ICREA to M.M.A.)