1,211 research outputs found

    A framework for automated enrichment of functionally significant inverted repeats in whole genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA transcripts from genomic sequences showing dyad symmetry typically adopt hairpin-like, cloverleaf, or similar structures that act as recognition sites for proteins. Such structures often are the precursors of non-coding RNA (ncRNA) sequences like microRNA (miRNA) and small-interfering RNA (siRNA) that have recently garnered more functional significance than in the past. Genomic DNA contains hundreds of thousands of such inverted repeats (IRs) with varying degrees of symmetry. But by collecting statistically significant information from a known set of ncRNA, we can sort these IRs into those that are likely to be functional.</p> <p>Results</p> <p>A novel method was developed to scan genomic DNA for partially symmetric inverted repeats and the resulting set was further refined to match miRNA precursors (pre-miRNA) with respect to their density of symmetry, statistical probability of the symmetry, length of stems in the predicted hairpin secondary structure, and the GC content of the stems. This method was applied on the <it>Arabidopsis thaliana</it> genome and validated against the set of 190 known Arabidopsis pre-miRNA in the miRBase database. A preliminary scan for IRs identified 186 of the known pre-miRNA but with 714700 pre-miRNA candidates. This large number of IRs was further refined to 483908 candidates with 183 pre-miRNA identified and further still to 165371 candidates with 171 pre-miRNA identified (i.e. with 90% of the known pre-miRNA retained).</p> <p>Conclusions</p> <p>165371 candidates for potentially functional miRNA is still too large a set to warrant wet lab analyses, such as northern blotting, on all of them. Hence additional filters are needed to further refine the number of candidates while still retaining most of the known miRNA. These include detection of promoters and terminators, homology analyses, location of candidate relative to coding regions, and better secondary structure prediction algorithms. The software developed is designed to easily accommodate such additional filters with a minimal experience in Perl.</p

    Computational methods for the discovery and analysis of genes and other functional DNA sequences

    Get PDF
    The need for automating genome analysis is a result of the tremendous amount of genomic data. As of today, a high-throughput DNA sequencing machine can run millions of sequencing reactions in parallel, and it is becoming faster and cheaper to sequence the entire genome of an organism. Public databases containing genomic data are growing exponentially, and hence the rise in demand for intuitive automated methods of DNA analysis and subsequent gene identification. However, the complexity of gene organization makes automation a challenging task, and smart algorithm design and parallelization are necessary to perform accurate analyses in reasonable amounts of time. This work describes two such automated methods for the identification of novel genes within given DNA sequences. The first method utilizes negative selection patterns as an evolutionary rationale for the identification of additional members of a gene family. As input it requires a known protein coding gene in that family. The second method is a massively parallel data mining algorithm that searches a whole genome for inverted repeats (palindromic sequences) and identifies potential precursors of non-coding RNA genes. Both methods were validated successfully on the fully sequenced and well studied plant species, Arabidopsis thaliana --Abstract, page iv

    Genome bioinformatics of tomato and potato

    Get PDF
    In the past two decades genome sequencing has developed from a laborious and costly technology employed by large international consortia to a widely used, automated and affordable tool used worldwide by many individual research groups. Genome sequences of many food animals and crop plants have been deciphered and are being exploited for fundamental research and applied to improve their breeding programs. The developments in sequencing technologies have also impacted the associated bioinformatics strategies and tools, both those that are required for data processing, management, and quality control, and those used for interpretation of the data. This thesis focuses on the application of genome sequencing, assembly and annotation to two members of the Solanaceae family, tomato and potato. Potato is the economically most important species within the Solanaceae, and its tubers contribute to dietary intake of starch, protein, antioxidants, and vitamins. Tomato fruits are the second most consumed vegetable after potato, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. The chapters in this thesis document the generation, exploitation and interpretation of genomic sequence resources for these two species and shed light on the contents, structure and evolution of their genomes. Chapter 1introduces the concepts of genome sequencing, assembly and annotation, and explains the novel genome sequencing technologies that have been developed in the past decade. These so-called Next Generation Sequencing platforms display considerable variation in chemistry and workflow, and as a consequence the throughput and data quality differs by orders of magnitude between the platforms. The currently available sequencing platforms produce a vast variety of read lengths and facilitate the generation of paired sequences with an approximately fixed distance between them. The choice of sequencing chemistry and platform combined with the type of sequencing template demands specifically adapted bioinformatics for data processing and interpretation. Irrespective of the sequencing and assembly strategy that is chosen, the resulting genome sequence, often represented by a collection of long linear strings of nucleotides, is of limited interest by itself. Interpretation of the genome can only be achieved through sequence annotation – that is, identification and classification of all functional elements in a genome sequence. Once these elements have been annotated, sequence alignments between multiple genomes of related accessions or species can be utilized to reveal the genetic variation on both the nucleotide and the structural level that underlies the difference between these species or accessions. Chapter 2describes BlastIf, a novel software tool that exploits sequence similarity searches with BLAST to provide a straightforward annotation of long nucleotide sequences. Generally, two problems are associated with the alignment of a long nucleotide sequence to a database of short gene or protein sequences: (i) the large number of similar hits that can be generated due to database redundancy; and (ii) the relationships implied between aligned segments within a hit that in fact correspond to distinct elements on the sequence such as genes. BlastIf generates a comprehensible BLAST output for long nucleotide sequences by reducing the number of similar hits while revealing most of the variation present between hits. It is a valuable tool for molecular biologists who wish to get a quick overview of the genetic elements present in a newly sequenced segment of DNA, prior to more elaborate efforts of gene structure prediction and annotation. In Chapter 3 a first genome-wide comparison between the emerging genomic sequence resources of tomato and potato is presented. Large collections of BAC end sequences from both species were annotated through repeat searches, transcript alignments and protein domain identification. In-depth comparisons of the annotated sequences revealed remarkable differences in both gene and repeat content between these closely related genomes. The tomato genome was found to be more repetitive than the potato genome, and substantial differences in the distribution of Gypsy and Copia retrotransposable elements as well as microsatellites were observed between the two genomes. A higher gene content was identified in the potato sequences, and in particular several large gene families including cytochrome P450 mono-oxygenases and serine-threonine protein kinases were significantly overrepresented in potato compared to tomato. Moreover, the cytochrome P450 gene family was found to be expanded in both tomato and potato when compared to Arabidopsis thaliana, suggesting an expanded network of secondary metabolic pathways in the Solanaceae. Together these findings present a first glimpse into the evolution of Solanaceous genomes, both within the family and relative to other plant species. Chapter 4explores the physical and genetic organization of tomato chromosome 6 through integration of BAC sequence analysis, High Information Content Fingerprinting, genetic analysis, and BAC-FISH mapping data. A collection of BACs spanning substantial parts of the short and long arm euchromatin and several dispersed regions of the pericentrometric heterochromatin were sequenced and assembled into several tiling paths spanning approximately 11 Mb. Overall, the cytogenetic order of BACs was in agreement with the order of BACs anchored to the Tomato EXPEN 2000 genetic map, although a few striking discrepancies were observed. The integration of BAC-FISH, sequence and genetic mapping data furthermore provided a clear picture of the borders between eu- and heterochromatin on chromosome 6. Annotation of the BAC sequences revealed that, although the majority of protein-coding genes were located in the euchromatin, the highly repetitive pericentromeric heterochromatin displayed an unexpectedly high gene content. Moreover, the short arm euchromatin was relatively rich in repeats, but the ratio of Gypsy and Copia retrotransposons across the different domains of the chromosome clearly distinguished euchromatin from heterochromatin. The ongoing whole-genome sequencing effort will reveal if these properties are unique for tomato chromosome 6, or a more general property of the tomato genome. Chapter 5presents the potato genome, the first genome sequence of an Asterid. To overcome the problems associated with genome assembly due tothe high level of heterozygosity that is observed in commercial tetraploid potato varieties, a homozygous doubled-monoploid potato clone was exploited to sequence and assemble 86% of the 844 Mb genome. This potato reference genome sequence was complemented with re-sequencing of aheterozygous diploid clone, revealing the form and extent of sequence polymorphism both between different genotypes and within a single heterozygous genotype. Gene presence/absence variants and other potentially deleterious mutations were found to occur frequently in potato and are a likely cause of inbreeding depression. Annotation of the genome was supported by deep transcriptome sequencing of both the doubled-monoploid and the heterozygous potato, resulting in the prediction of more than 39,000 protein coding genes. Transcriptome analysis provided evidence for the contribution of gene family expansion, tissue specific expression, and recruitment of genes to new pathways to the evolution of tuber development. The sequence of the potato genome has provided new insights into Eudicot genome evolution and has provided a solid basis for the elucidation of the evolution of tuberisation. Many traits of interest to plant breeders are quantitative in nature and the potato sequence will simplify both their characterization and deployment to generate novel cultivars. The outstanding challenges in plant genome sequencing are addressed in Chapter 6. The high concentration of repetitive elements and the heterozygosity and polyploidy of many interesting crop plant species currently pose a barrier for the efficient reconstruction of their genome sequences. Nonetheless, the completion of a large number of new genome sequences in recent years and the ongoing advances in sequencing technology provide many excitingopportunities for plant breeding and genome research. Current sequencing platforms are being continuously updated and improved, and novel technologies are being developed and implemented in third-generation sequencing platforms that sequence individual molecules without need for amplification. While these technologies create exciting opportunities for new sequencing applications, they also require robust software tools to process the data produced through them efficiently. The ever increasing amount of available genome sequences creates the need for an intuitive platform for the automated and reproducible interrogation of these data in order to formulate new biologically relevant questions on datasets spanning hundreds or thousands of genome sequences. </p

    Strukturell variasjon som påvirker genetisk miljøtilpasning i laksefisk

    Get PDF
    Structural variations (SVs), e.g. deletions, insertions, inversions and duplications of sequences, are a major source of genomic variation affecting more base pairs in the genome than single nucleotide polymorphisms (SNPs). Despite their increasingly recognised importance in adaptive evolution and species diversification, SVs are vastly understudied in most species. Long-read sequencing, together with recently developed bioinformatic tools, have provided step-change improvements in the precision and recall of SV detection and allow us to increase the detected SVs manyfold across the species range. In addition, long-reads represent a major shift in our ability to build continuous genome assemblies as fundamental resources for most genome wide studies. The work in this thesis utilises long-read data to generate multiple genome sequences for the two salmonid species Atlantic salmon (Salmo salar) and lake whitefish (Coregonus clupeaformis). We present the first pan-genome for Atlantic salmon, comprising 11 long-read-based assemblies across the species range. Among these, the highest quality genome has 2.55 Gbp assembled into chromosome sequences, 259 Mbp more sequence than in the previous Atlantic salmon reference genome. The genome has a highly improved continuity with contig N50 increasing from 58 kbp to 28.06 Mbp (484-fold). The detection of SVs in these 11 individuals, revealed 1,061,452 SVs, with an average of ~77.4 Mbp of sequence differing per sample. The Atlantic salmon has adapted to different river environment across a large geographical distribution. To investigate genomic variation underlying these adaptations, we associated SVs and environmental data in a dataset of 366 short-read samples genotyped using genome graph analyses. These analyses highlighted multiple SVs contributing to environmental adaptations, including an 18 kbp deletion encompassing a polymorphic segmental duplication of three genes associated with annual precipitation. Next, we use the Atlantic salmon pan-genome to study the emergence of supergenes. Because supergenes can be maintained over millions of years by balancing selection and typically exhibit strong recombination suppression, their underlying functional variants and how they are formed are largely unknown. Inversions are type of rearrangement commonly associated with supergenes, and by directly comparing multiple highly continuous genome assemblies we were able to detect a number of large inversions in Atlantic salmon. A 3 Mb inversion, estimated to be ~15,000-year-old, and segregating in North American populations, displayed supergene signatures with adaptive variation captured within the standard arrangement of the inversion, as well as other adaptive variation accumulating after the inversion occurred. Characterization of other inversions with matched repeat structures at the breakpoints did not show any supergene signatures, suggesting that shared breakpoint repeats may obstruct the supergene formation. Lastly, we created long-read based genome assemblies for sympatric species pairs (Dwarf and Normal) belonging to lake whitefish (Coregonus clupeaformis). The species pairs offer a suitable model system for studying genomic patterns of differentiation and in particular the role of SVs in speciation. By combining long-reads, direct assembly, and short-read methods we detect 89,909 high-confidence SVs in the species pair across two lakes, covering five times more sequence in the genome compared to SNPs. In the study, we highlight shared outliers of differentiation between the lakes, indicating that they contribute to speciation. Interestingly, we find that more than 70% of SVs differentiating between the Normal and Dwarf species pairs of lake whitefish are overlapping transposable elements. This work demonstrates that SVs may play an important role for the differentiation and speciation of sympatric species pairs in lake whitefish.Strukturell variasjon (SVer), for eksempel delesjoner, insersjoner, inversjoner og duplikasjoner av sekvens, er en viktig kilde til genomisk variasjon som samplet sett påvirker flere basepar i genomet enn punktmutasjoner (SNPs). Til tross for en økende annerkjennelse for at SVer spiller en viktig rolle i genetisk tilpassing til ulikt miljø og artsdannelse har denne typen variasjon vært lite studert i mange arter. Ny DNA-sekvenseringsteknologi med lengre leselengder (long-read sequencing), samt utvikling av nye bioinformatiske verktøy, har ført til drastiske forbedringer i deteksjonen av SVer. ‘Long-read’ sekvensering gjør det også mulig å lage mer komplette og sammenhengende genomsekvenser enn tidligere. I denne avhandlingen benytter vi oss av ‘long-read’ data til å lage flere genomsekvenser av høy kvalitet for to ulike laksefiskarter: Atlanterhavslaks (Salmo salar) og en Nordamerikansk type sik ‘lake whitefish’ (Coregonus clupeaformis). Her rapporterer vi det første pan-genomet for Atlanterhavslaks. Det består av 11 assemblier basert på ‘long- read’ sekvensering av individer fra fire ulike fylogeografiske grupper av villaks. Assembliet av høyest kvalitet inkluderer 2,55 Gbp sekvens i kromosomer, 259 Mbp mer enn det forrige referansegenomet til Atlanterhavslaks. I tillegg ble andelen sammenhengende sekvens, målt som contig N50, økt fra 58 kbp til 28,06 Mbp (484 ganger høyere). Vi fant 1.061.452 SVer på tvers av de 11 individene med ~77,4 Mbp gjennomsnittlig sekvensforskjell per prøve. Atlanterhavslaksen har over tid tilpasset miljøet i ulike elver. For å studere underliggende genetisk variasjon for denne tilpasningen assosierte vi SVer med ulike miljøvariabler i et datasett bestående av 366 ‘short-read’ sekvenserte prøver ved bruk av en genom-graf. Ved hjelp av disse analysene fant vi flere SVer som bidrar til miljøtilpasning, blant annet en 18 kbp lang delesjon som inneholder tre gener assosiert med mengden nedbør i området. Vi brukte så pan-genomet for Atlanterhavsaks til å studere dannelsen av ‘supergener’. Supergener er en sammenkobling av genetisk variasjon i koblingsulikevekt som for eksempel kan oppstå ved hjelp av store inversjoner. Her utnyttet vi 11 genomassemblier til å identifisere og karakterisere en rekke store inversjoner i Atlanterhavslaks. En av inversjonene på 3 Mbp, estimert til å være ~15.000 år gammel, viste signaturer for utvikling som supergen. For de andre inversjonene som var flankert av repetert DNA fant vi ikke karakteristiske trekk på supergener, noe som tyder på at det repetitive DNA forhindrer en dannelse av supergener. Til slutt lagde vi genomsekvenser for ulike former (‘Normal’ og ‘Dwarf’) av ‘lake whitefish’ (Coregonus clupeaformis) som lever i de samme innsjøene i Nord-Amerika. Genomsekvensene muliggjør studier av genomiske mekanismene bak artsdannelse i denne laksefisken. Ved å kombinere ‘long-read’ data, direkte sammenlikning av assemblier, og ‘short-read’ data fant vi 89,909 SVer som skilte de to formene av ‘lake whitefish’ i to innsjøer. SVene omfatter mer enn fem ganger flere basepar i genomet sammenlignet med SNPs. I studiet fant vi flere SVer med avvikende forekomst (‘outliers’) i de to formene av ‘lake whitefish’, noe som indikerer at disse SVene bidrar til artsdannelse. Videre fant vi at 70 % av SVene overlappet en form av repetert DNA kalt transposable elementer. Dette arbeidet understreker at SVer kan spille en viktig rolle for artsdannelse i ’lake whitefish’

    Genome Evolution in the Salicaceae: Genetic Novelty, Horizontal Gene Transfer, and Comparative Genomics

    Get PDF
    Genome evolution is a powerful force which shapes genomes over time through processes like mutation, horizontal transfer, and sexual reproduction. Although questions which aim to explore genome evolution are broad, they are all understood through the discovery and comparison of genetic variation. For example, genetic diversity may explain differences in phenotypes, etiology of disease, and is essential for phylogenomic analysis. Recently, the democratization of next generation and third generation DNA sequencing technologies have allowed for genomics to produce large amounts of sequence data. This has facilitated the capture of genetic variation at species and population scales. Populus and Salix are members of the Salicaceae family and are ecologically and economically important woody plants. Currently, there are multiple high-quality reference genomes available for these two genera. Two important sources of genome evolution that will be explored here are genetic novelty in the form of new genes and horizontal gene transfer from the organelle genomes. In the context of genome evolution, both processes have been shown to contribute to beneficial phenotypes as well as disease. The primary contributions of this dissertation research are to identify and assign putative functions to orphan and de novo genes in P. trichocarpa, identify and compare horizontal transfer from the organelle genomes to the nuclear genomes of P. trichocarpa and P. deltoides, and generate new organelle genome resources for 6 different Salix species

    Detailed Analysis of a Contiguous 22-Mb Region of the Maize Genome

    Get PDF
    Most of our understanding of plant genome structure and evolution has come from the careful annotation of small (e.g., 100 kb) sequenced genomic regions or from automated annotation of complete genome sequences. Here, we sequenced and carefully annotated a contiguous 22 Mb region of maize chromosome 4 using an improved pseudomolecule for annotation. The sequence segment was comprehensively ordered, oriented, and confirmed using the maize optical map. Nearly 84% of the sequence is composed of transposable elements (TEs) that are mostly nested within each other, of which most families are low-copy. We identified 544 gene models using multiple levels of evidence, as well as five miRNA genes. Gene fragments, many captured by TEs, are prevalent within this region. Elimination of gene redundancy from a tetraploid maize ancestor that originated a few million years ago is responsible in this region for most disruptions of synteny with sorghum and rice. Consistent with other sub-genomic analyses in maize, small RNA mapping showed that many small RNAs match TEs and that most TEs match small RNAs. These results, performed on ∼1% of the maize genome, demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses

    Genome architecture and diverged selection shaping pattern of genomic differentiation in wild barley

    Get PDF
    Divergent selection of populations in contrasting environments leads to functional genomic divergence. However, the genomic architecture underlying heterogeneous genomic differentiation remains poorly understood. Here, we de novo assembled two high-quality wild barley (Hordeum spontaneum K. Koch) genomes and examined genomic differentiation and gene expression patterns under abiotic stress in two populations. These two populations had a shared ancestry and originated in close geographic proximity but experienced different selective pressures due to their contrasting micro-environments. We identified structural variants that may have played significant roles in affecting genes potentially associated with well-differentiated phenotypes such as flowering time and drought response between two wild barley genomes. Among them, a 29-bp insertion into the promoter region formed a cis-regulatory element in the HvWRKY45 gene, which may contribute to enhanced tolerance to drought. A single SNP mutation in the promoter region may influence HvCO5 expression and be putatively linked to local flowering time adaptation. We also revealed significant genomic differentiation between the two populations with ongoing gene flow. Our results indicate that SNPs and small SVs link to genetic differentiation at the gene level through local adaptation and are maintained through divergent selection. In contrast, large chromosome inversions may have shaped the heterogeneous pattern of genomic differentiation along the chromosomes by suppressing chromosome recombination and gene flow. Our research offers novel insights into the genomic basis underlying local adaptation and provides valuable resources for the genetic improvement of cultivated barley