1,100 research outputs found

    The genomic features that affect the lengths of 5’ untranslated regions in multicellular eukaryotes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The lengths of 5’UTRs of multicellular eukaryotes have been suggested to be subject to stochastic changes, with upstream start codons (uAUGs) as the major constraint to suppress 5’UTR elongation. However, this stochastic model cannot fully explain the variations in 5’UTR length. We hypothesize that the selection pressure on a combination of genomic features is also important for 5’UTR evolution. The ignorance of these features may have limited the explanatory power of the stochastic model. Furthermore, different selective constraints between vertebrates and invertebrates may lead to differences in the determinants of 5’UTR length, which have not been systematically analyzed.</p> <p>Methods</p> <p>Here we use a multiple linear regression model to delineate the correlation between 5’UTR length and the combination of a series of genomic features (G+C content, observed-to-expected (OE) ratios of uAUGs, upstream stop codons (uSTOPs), methylation-related CG/UG dinucleotides, and mRNA-destabilizing UU/UA dinucleotides) in six vertebrates (human, mouse, rat, chicken, African clawed frog, and zebrafish) and four invertebrates (fruit fly, mosquito, sea squirt, and nematode). The relative contributions of each feature to the variation of 5’UTR length were also evaluated.</p> <p>Results</p> <p>We found that 14%~33% of the 5’UTR length variations can be explained by a linear combination of the analyzed genomic features. The most important genomic features are the OE ratios of uSTOPs and G+C content. The surprisingly large weightings of uSTOPs highlight the importance of selection on upstream open reading frames (which include both uAUGs and uSTOPs), rather than on uAUGs <it>per se</it>. Furthermore, G+C content is the most important determinants for most invertebrates, but for vertebrates its effect is second to uSTOPs. We also found that shorter 5’UTRs are affected more by the stochastic process, whereas longer 5’UTRs are affected more by selection pressure on genomic features.</p> <p>Conclusions</p> <p>Our results suggest that upstream open reading frames may be the real target of selection, rather than uAUGs. We also show that the selective constraints on genomic features of 5’UTRs differ between vertebrates and invertebrates, and between longer and shorter 5’UTRs. A more comprehensive model that takes these findings into consideration is needed to better explain 5’UTR length evolution.</p

    Quantitative principles of cis-translational control by general mRNA sequence features in eukaryotes.

    Get PDF
    BackgroundGeneral translational cis-elements are present in the mRNAs of all genes and affect the recruitment, assembly, and progress of preinitiation complexes and the ribosome under many physiological states. These elements include mRNA folding, upstream open reading frames, specific nucleotides flanking the initiating AUG codon, protein coding sequence length, and codon usage. The quantitative contributions of these sequence features and how and why they coordinate to control translation rates are not well understood.ResultsHere, we show that these sequence features specify 42-81% of the variance in translation rates in Saccharomyces cerevisiae, Schizosaccharomyces pombe, Arabidopsis thaliana, Mus musculus, and Homo sapiens. We establish that control by RNA secondary structure is chiefly mediated by highly folded 25-60 nucleotide segments within mRNA 5' regions, that changes in tri-nucleotide frequencies between highly and poorly translated 5' regions are correlated between all species, and that control by distinct biochemical processes is extensively correlated as is regulation by a single process acting in different parts of the same mRNA.ConclusionsOur work shows that general features control a much larger fraction of the variance in translation rates than previously realized. We provide a more detailed and accurate understanding of the aspects of RNA structure that directs translation in diverse eukaryotes. In addition, we note that the strongly correlated regulation between and within cis-control features will cause more even densities of translational complexes along each mRNA and therefore more efficient use of the translation machinery by the cell

    In plants, expression breadth and expression level distinctly and non-linearly correlate with gene structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Compactness of highly/broadly expressed genes in human has been explained as selection for efficiency, regional mutation biases or genomic design. However, highly expressed genes in flowering plants were shown to be less compact than lowly expressed ones. On the other hand, opposite facts have also been documented that pollen-expressed <it>Arabidopsis </it>genes tend to contain shorter introns and highly expressed moss genes are compact. This issue is important because it provides a chance to compare the selectionism and the neutralism views about genome evolution. Furthermore, this issue also helps to understand the fates of introns, from the angle of gene expression.</p> <p>Results</p> <p>In this study, I used expression data covering more tissues and employ new analytical methods to reexamine the correlations between gene expression and gene structure for two flowering plants, <it>Arabidopsis thaliana </it>and <it>Oryza sativa</it>. It is shown that, different aspects of expression pattern correlate with different parts of gene sequences in distinct ways. In detail, expression level is significantly negatively correlated with gene size, especially the size of non-coding regions, whereas expression breadth correlates with non-coding structural parameters positively and with coding region parameters negatively. Furthermore, the relationships between expression level and structural parameters seem to be non-linear, with the extremes of structural parameters possibly scale as power-laws or logrithmic functions of expression levels.</p> <p>Conclusion</p> <p>In plants, highly expressed genes are compact, especially in the non-coding regions. Broadly expressed genes tend to contain longer non-coding sequences, which may be necessary for complex regulations. In combination with previous studies about other plants and about animals, some common scenarios about the correlation between gene expression and gene structure begin to emerge. Based on the functional relationships between extreme values of structural characteristics and expression level, an effort was made to evaluate the relative effectiveness of the energy-cost hypothesis and the time-cost hypothesis.</p> <p>Reviewers</p> <p>This article was reviewed by Dr. I. King Jordan, Dr. Liran Carmel (nominated by Dr. Eugene V. Koonin) and Dr. Fyodor A. Kondrashov.</p

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Transcriptomic Study of Gracilaria Changii (Gracilariales, Rhodophyta) By Expressed Sequence Tags and Cdna Microarray Approach

    Get PDF
    Gracilaria is one of the most extensively harvested seaweeds throughout the world due to its economic importance as an agarophyte for global agar production. In this study, Gracilaria changii, an indigenous seaweed species in Malaysia known for its high quality agar was chosen for transcription profiling. A total of 990 expressed sequence tags (ESTs) consisting of 766 tentative unique genes (TUGs) have been generated. These TUGs comprise 643 TUSs (tentative unique singletons) and 123 TUCs (tentative unique contigs). The putative identity of TUGs was identified by using Basic Local Alignment Search Tool X (BLASTX) algorithm and classified according to the functional groups in Kyoto Encyclopedia of Genes and Genomes (KEGG). The result showed that 198 TUGs (25.85%) have significant matches to the annotated proteins and 81 TUGs (10.57%) have significant matches to the unknown proteins in the non-redundant protein database in the GenBank; whereas the remaining 487 (63.58%) TUGs had non-significant matches or no matches to sequence from other organisms. Similar to animals and plants, G. changii showed preference for purine residues at -3 position and guanidine at +4 position at the translational initiation signal. On the other hand, the 3' untranslated region of G. changii was found to have relatively less stringent polyadenylation process compared with plants and animals in producing mature mRNA. A cDNA microarray consisting of approximately 3,000 cDNA probes was constructed and used for the hybridization of cDNAs synthesized from G. changii samples cultured under conditions with and without light to understand the genetic acclimation of G. changii to light deprivation. The results suggested that genes related to photosynthesis, oxidative stress and sulfate metabolism were down-regulated during light deprivation. The cDNA microarray data were further verified by using real-time PCR. The results of the real-time PCR analysis of four genes encoding light-harvesting complex I polypeptide (DV962275.1), low molecular mass early light-inducible protein (DV964113.1), 14-3-3 protein (DV965610.1) and sonic hedgehog protein precursor (DV967367.1) supported the expression patterns demonstrated using cDNA microarray

    Genome-wide analysis of alternative splicing in

    Get PDF

    Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth

    Get PDF
    Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page Abstract Acknowledgements Index 1. The structure of this thesis 1.1. Structure of this PhD dissertation 1.2. Publications of this PhD dissertation 1.3. Computational infrastructure and resources 1.4. Disclosure of financial support and information use 1.5. Acknowledgements 1.6. Author contributions and use of impersonal and personal pronouns 2. Biological background 2.1. The complexity of the eukaryotic genome 2.2. The problem of counting and defining “genes” in eukaryotes 2.3. The “function” concept for genes and “dark matter” 2.4. Increases of organismal complexity on Earth through multicellularity 2.5. Multicellularity is a “fitness transition” in individuality 2.6. The complexity of cell differentiation in multicellularity 3. Technical background 3.1. The Phylogenetic Comparative Method (PCM) 3.2. RNA secondary structure prediction 3.3. Some standards for genome and gene annotation 4. What is in a eukaryotic genome? GenomeContent provides a good answer 4.1. Background 4.2. Motivation: an interoperable tool for data retrieval of gene annotations 4.3. Methods 4.4. Results 4.5. Discussion 5. The evolutionary correlation between genome size and ncDNA 5.1. Background 5.2. Motivation: estimating the relationship between genome size and ncDNA 5.3. Methods 5.4. Results 5.5. Discussion 6. The relationship between non-coding DNA and Complex Multicellularity 6.1. Background 6.2. Motivation: How to define and measure complex multicellularity across eukaryotes? 6.3. Methods 6.4. Results 6.5. Discussion 7. The ceRNA motif pipeline: regulation of microRNAs by target mimics 7.1. Background 7.2. A revisited protocol for the computational analysis of Target Mimics 7.3. Motivation: a novel pipeline for ceRNA motif discovery 7.4. Methods 7.5. Results 7.6. Discussion 8. Conclusions and outlook 8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses 8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya 8.3. “Complex multicellularity” is a major evolutionary transition 8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth 9. Supplementary Data Bibliography Curriculum Scientiae Selbständigkeitserklärung (declaration of authorship

    Investigation of the length distributions of coding and noncoding sequences in relation to gene architecture, function, and expression

    Get PDF
    The last 20 years has seen the birth of bioinformatics, and is defined as the combination of mathematics, biology, and computational approaches. This discipline has led to the era of ontology, extensive databases including sequences, structures, expression profiles, and genomes and database cross-referencing, (Ouzounis, 2012). Before this discipline, scientists referenced atlas books, such as Margret Dayhoff’s protein sequence collection (Strasser, 2010) which required long hours of letter counting. Through the development of sequencing technology over the past forty years, a tremendous amount of genomic sequencing data has already been collected. With a surge of such data increasing, so does the challenges of data organisation, accessibility and interpretation, with interpretation being the most challenging (Ouzounis, 2012)

    A Study of Selection on Microsatellites in the Helianthus Annuus Transcriptome

    Get PDF
    The ability of populations to continually respond to directional selection even after many generations instead of reaching response plateaus suggests the presence of mechanisms for rapidly generating novel adaptive variation within organismal genomes. The contributions of cis regulation are now being widely studied. This study details the contributions of one such mechanism capable of generating adaptive genetic variation through transcribed microsatellite mutation. Microsatellites are abundant in eukaryotic genomes, exhibit one of the highest known mutation rates; and mutations involve indels that are reversible. These features make them excellent candidates for generating variation in populations. This study explores the functional roles of transcribed microsatellites in Helianthus annuus (common sunflower). More specifically, I explored the role of microsatellites as agents of rapid change that act as “tuning knobs” of phenotypic variation by influencing gene expression in a stepwise manner by expansions and contractions of the microsatellite tract. A bioinformatic study suggests that selection has favored expansion and maintenance of transcriptomic microsatellites. This inference is based on the non-random distribution of microsatellites, prevalence of motifs associated with gene regulation in untranslated regions, and the enrichment of microsatellites in Gene Ontologies representing plant response to stress and stimulus. A population genetics study provides support for selection on these transcribed microsatellites when compared to anonymous microsatellites that were assumed to evolve neutrally. The natural populations utilized in this study show greater similarity in allele frequencies, mean length, and variance in lengths at the transcribed microsatellites relative to that observed at anonymous microsatellite loci. This finding is indicative of balancing selection, and provides evidence that allele lengths are under selection. This finding provides support for the tuning knob hypothesis. The findings of a functional genomic study with regard to the tuning knob hypothesis are ambiguous. No correlation between allele lengths and gene expression was detected at any of three loci investigated. However, the loci utilized exhibited narrow ranges in length. The tuning knob hypothesis implies that similar allele lengths are likely to exhibit similar gene expression levels. Hence, variation in the populations studied may be tracking the optimal gene expression levels
    corecore