44 research outputs found

    Understanding the functionality of transcript diversity

    Get PDF
    Recent years have seen a huge increase in the amount of genomic DNA being sequenced from a wide variety of organisms, giving us an unprecedented insight into the molecular diversity seen in nature. As a result a host of methods have been developed, both experimental and computational, to understand the functional significance of such diversity and how it relates to organismal and environmental complexity. In this thesis I use comparative approaches to explore two areas of molecular biology where there is evidence for large amounts of transcript diversity. Firstly, I explore the unprecedented view of microbial sequence diversity offered by metagenomic sequencing projects, using sequence similarity and adapted genomic context methods to quantify the amount of functional novelty in these samples. Secondly, I look at the transcript diversity generated by alternative splicing. I develop methods to detect and visualise alternative splicing events and apply these to the detection of conserved alternative splicing events

    Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation.</p> <p>Results</p> <p>We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124).</p> <p>Conclusion</p> <p>Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.</p

    Protein coding potential of retroviruses and other transposable elements in vertebrate genomes

    Get PDF
    We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions

    Identification and Analysis of Genes and Pseudogenes within Duplicated Regions in the Human and Mouse Genomes

    Get PDF
    The identification and classification of genes and pseudogenes in duplicated regions still constitutes a challenge for standard automated genome annotation procedures. Using an integrated homology and orthology analysis independent of current gene annotation, we have identified 9,484 and 9,017 gene duplicates in human and mouse, respectively. On the basis of the integrity of their coding regions, we have classified them into functional and inactive duplicates, allowing us to define the first consistent and comprehensive collection of 1,811 human and 1,581 mouse unprocessed pseudogenes. Furthermore, of the total of 14,172 human and mouse duplicates predicted to be functional genes, as many as 420 are not included in current reference gene databases and therefore correspond to likely novel mammalian genes. Some of these correspond to partial duplicates with less than half of the length of the original source genes, yet they are conserved and syntenic among different mammalian lineages. The genes and unprocessed pseudogenes obtained here will enable further studies on the mechanisms involved in gene duplication as well as of the fate of duplicated genes

    Beyond Nanopore Sequencing in Space: Identifying the Unknown

    Get PDF
    Astronaut Kate Rubins sequenced DNA on the International Space Station (ISS) for the first time in August 2016 (Figure 1A). A 2D sequencing library containing an equal mixture of lambda bacteriophage, Escherichia coli, and Mus musculus was prepared on the ground with a SQK_MAP006 kit and sent to the ISS frozen and loaded into R7.3 flow cells. After a total of 9 on-orbit sequencing runs over 6 months, it was determined that there was no decrease in sequencing performance on-orbit compared to ground controls (1). A total of ~280,000 and ~130,000 reads generated on-orbit and on the ground, respectively, identified 90% of reads that were attributed to 30% lambda bacteriophage, 30% Escherichia coli, and 30% M. musculus (Figure 1B). Extensive bioinformatics analysis determined comparable 2D and 1D read accuracies between flight and ground runs (Figure 1C), and data collected from the ISS were able to construct directed assemblies of E.coli and lambda genomes at 100% and M. musculus mitochondrial genome at 96.7%. These findings validate sequencing as a viable option for potential on-orbit applications such as environmental microbial monitoring and disease diagnosis. Current microbial monitoring of the ISS applies culture-based techniques that provide colony forming unit (CFU) data for air, water, and surface samples. The identity of the cultured microorganisms in unknown until sample return and ground-based analysis, a process that can take up to 60 days. For sequencing to benefit ISS applications, spaceflight-compatible sample preparation techniques are required. Subsequent to the testing of the MinION on-orbit, a sample-to-sequence method was developed using miniPCR and basic pipetting, which was only recently proven to be effective in microgravity. The work presented here details the in- flight sample preparation process and the first application of DNA sequencing on the ISS to identify unknown ISS-derived microorganisms

    Ancestral Polymorphisms Shape the Adaptive Radiation of Metrosideros across the Hawaiian Islands

    Get PDF
    Some of the most spectacular adaptive radiations begin with founder populations on remote islands. How genetically limited founder populations give rise to the striking phenotypic and ecological diversity characteristic of adaptive radiations is a paradox of evolutionary biology. We conducted an evolutionary genomics analysis of genus Metrosideros, a landscape-dominant, incipient adaptive radiation of woody plants that spans a striking range of phenotypes and environments across the Hawaiian Islands. Using nanopore-sequencing, we created a chromosome-level genome assembly for Metrosideros polymorpha var. incana and analyzed whole-genome sequences of 131 individuals from 11 taxa sampled across the islands. Demographic modeling and population genomics analyses suggested that Hawaiian Metrosideros originated from a single colonization event and subsequently spread across the archipelago following the formation of new islands. The evolutionary history of Hawaiian Metrosideros shows evidence of extensive reticulation associated with significant sharing of ancestral variation between taxa and secondarily with admixture. Taking advantage of the highly contiguous genome assembly, we investigated the genomic architecture underlying the adaptive radiation and discovered that divergent selection drove the formation of differentiation outliers in paired taxa representing early stages of speciation/divergence. Analysis of the evolutionary origins of the outlier single nucleotide polymorphisms (SNPs) showed enrichment for ancestral variations under divergent selection. Our findings suggest that Hawaiian Metrosideros possesses an unexpectedly rich pool of ancestral genetic variation, and the reassortment of these variations has fueled the island adaptive radiation

    A network of conserved co-occurring motifs for the regulation of alternative splicing

    Get PDF
    Cis-acting short sequence motifs play important roles in alternative splicing. It is now possible to identify such sequence motifs as conserved sequence patterns in genome sequence alignments. Here, we report the systematic search for motifs in the neighboring introns of alternatively spliced exons by using comparative analysis of mammalian genome alignments. We identified 11 conserved sequence motifs that might be involved in the regulation of alternative splicing. These motifs are not only significantly overrepresented near alternatively spliced exons, but they also co-occur with each other, thus, forming a network of cis-elements, likely to be the basis for context-dependent regulation. Based on this finding, we applied the motif co-occurrence to predict alternatively skipped exons. We verified exon skipping in 29 cases out of 118 predictions (25%) by EST and mRNA sequences in the databases. For the predictions not verified by the database sequences, we confirmed exon skipping in 10 additional cases by using both RT–PCR experiments and the publicly available RNA-Seq data. These results indicate that even more alternative splicing events will be found with the progress of large-scale and high-throughput analyses for various tissue samples and developmental stages

    Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

    No full text
    Funder: NCI U24CA211006Abstract: The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts

    Predicting biological networks from genomic data

    Get PDF
    AbstractContinuing improvements in DNA sequencing technologies are providing us with vast amounts of genomic data from an ever-widening range of organisms. The resulting challenge for bioinformatics is to interpret this deluge of data and place it back into its biological context. Biological networks provide a conceptual framework with which we can describe part of this context, namely the different interactions that occur between the molecular components of a cell. Here, we review the computational methods available to predict biological networks from genomic sequence data and discuss how they relate to high-throughput experimental methods
    corecore