43 research outputs found

    Molecular biology based strategies to aid assembly in de novo genome projects

    Get PDF
    This thesis presents and critically assesses work undertaken and published between 2009 and 2018. It evaluates the benefits, limitations and impact of novel approaches to next generation sequencing library construction for de novo genome projects developed by the author. Since the first fully sequenced genome was published in 1978, DNA sequencing technology has advanced rapidly and costs reduced significantly. Next generation sequencers capable of sequencing millions of DNA molecules in parallel revolutionised the genomics industry. Today, if the right strategies are adopted, prokaryotic genomes can be fully sequenced in a matter of hours for a few hundred pounds and a high degree of contiguity achieved in even the most challenging eukaryotic genomes within a few weeks for tens of thousands of pounds. Chapter 2 describes the design and application of a bespoke, high throughput bacterial artificial chromosome sequencing pipeline designed to sequence complex eukaryotic genomes harbouring a wide variety of repeat structures. Chapter 3 focuses on novel approaches to optimise insert size in amplification-free, paired-end library construction and Chapter 4 discusses innovative solutions to construct large insert, highly complex long mate pair libraries which have much tighter insert size distributions than previously published methods. Chapter 5 demonstrates the application of the methods discussed in earlier chapters in wheat de novo genome projects, highlighting the benefits the author’s approaches bring to sequencing a complex polyploid plant genome. The presented methods establish new ways of thinking about next generation sequencing library construction, pushing the boundaries of complexity and maximising spatial information. Keywords: Genome assembly, next generation sequencing, DNA, de novo, amplification-free paired-end libraries, long mate pair libraries, bacterial artificial chromosomes

    A critical comparison of technologies for a plant genome sequencing project

    Get PDF
    BACKGROUND: A high-quality genome sequence of any model organism is an essential starting point for genetic and other studies. Older clone-based methods are slow and expensive, whereas faster, cheaper short-read-only assemblies can be incomplete and highly fragmented, which minimizes their usefulness. The last few years have seen the introduction of many new technologies for genome assembly. These new technologies and associated new algorithms are typically benchmarked on microbial genomes or, if they scale appropriately, on larger (e.g., human) genomes. However, plant genomes can be much more repetitive and larger than the human genome, and plant biochemistry often makes obtaining high-quality DNA that is free from contaminants difficult. Reflecting their challenging nature, we observe that plant genome assembly statistics are typically poorer than for vertebrates. RESULTS: Here, we compare Illumina short read, Pacific Biosciences long read, 10x Genomics linked reads, Dovetail Hi-C, and BioNano Genomics optical maps, singly and combined, in producing high-quality long-range genome assemblies of the potato species Solanum verrucosum. We benchmark the assemblies for completeness and accuracy, as well as DNA compute requirements and sequencing costs. CONCLUSIONS: The field of genome sequencing and assembly is reaching maturity, and the differences we observe between assemblies are surprisingly small. We expect that our results will be helpful to other genome projects, and that these datasets will be used in benchmarking by assembly algorithm developers.</p

    Semi‐quantitative characterisation of mixed pollen samples using MinION sequencing and Reverse Metagenomics (RevMet)

    Get PDF
    1. The ability to identify and quantify the constituent plant species that make up a mixed‐species sample of pollen has important applications in ecology, conservation, and agriculture. Recently, metabarcoding protocols have been developed for pollen that can identify constituent plant species, but there are strong reasons to doubt that metabarcoding can accurately quantify their relative abundances. A PCR‐free, shotgun metagenomics approach has greater potential for accurately quantifying species relative abundances, but applying metagenomics to eukaryotes is challenging due to low numbers of reference genomes. 2. We have developed a pipeline, RevMet (Reverse Metagenomics) that allows reliable and semi‐quantitative characterization of the species composition of mixed‐species eukaryote samples, such as bee‐collected pollen, without requiring reference genomes. Instead, reference species are represented only by ‘genome skims’: low‐cost, low‐coverage, short‐read sequence datasets. The skims are mapped to individual long reads sequenced from mixed‐species samples using the MinION, a portable nanopore sequencing device, and each long read is uniquely assigned to a plant species. 3. We genome‐skimmed 49 wild UK plant species, validated our pipeline with mock DNA mixtures of known composition, and then applied RevMet to pollen loads collected from wild bees. We demonstrate that RevMet can identify plant species present in mixed‐species samples at proportions of DNA ≄ 1%, with few false positives and false negatives, and reliably differentiate species represented by high versus low amounts of DNA in a sample. 4. RevMet could readily be adapted to generate semi‐quantitative datasets for a wide range of mixed eukaryote samples. Our per‐sample costs were ÂŁ90 per genome skim and ÂŁ60 per pollen sample, and new versions of sequencers available now will further reduce these costs

    Metagenomic analysis of planktonic riverine microbial consortia using nanopore sequencing reveals insight into river microbe taxonomy and function

    Get PDF
    Background Riverine ecosystems are biogeochemical powerhouses driven largely by microbial communities that inhabit water columns and sediments. Because rivers are used extensively for anthropogenic purposes (drinking water, recreation, agriculture, and industry), it is essential to understand how these activities affect the composition of river microbial consortia. Recent studies have shown that river metagenomes vary considerably, suggesting that microbial community data should be included in broad-scale river ecosystem models. But such ecogenomic studies have not been applied on a broad “aquascape” scale, and few if any have applied the newest nanopore technology. Results We investigated the metagenomes of 11 rivers across 3 continents using MinION nanopore sequencing, a portable platform that could be useful for future global river monitoring. Up to 10 Gb of data per run were generated with average read lengths of 3.4 kb. Diversity and diagnosis of river function potential was accomplished with 0.5–1.0 ⋅ 106 long reads. Our observations for 7 of the 11 rivers conformed to other river-omic findings, and we exposed previously unrecognized microbial biodiversity in the other 4 rivers. Conclusions Deeper understanding that emerged is that river microbial consortia and the ecological functions they fulfil did not align with geographic location but instead implicated ecological responses of microbes to urban and other anthropogenic effects, and that changes in taxa manifested over a very short geographic space

    The WiggleZ Dark Energy Survey: the transition to large-scale cosmic homogeneity

    Get PDF
    We have made the largest-volume measurement to date of the transition to large-scale homogeneity in the distribution of galaxies. We use the WiggleZ survey, a spectroscopic survey of over 200,000 blue galaxies in a cosmic volume of ~1 (Gpc/h)^3. A new method of defining the 'homogeneity scale' is presented, which is more robust than methods previously used in the literature, and which can be easily compared between different surveys. Due to the large cosmic depth of WiggleZ (up to z=1) we are able to make the first measurement of the transition to homogeneity over a range of cosmic epochs. The mean number of galaxies N(<r) in spheres of comoving radius r is proportional to r^3 within 1%, or equivalently the fractal dimension of the sample is within 1% of D_2=3, at radii larger than 71 \pm 8 Mpc/h at z~0.2, 70 \pm 5 Mpc/h at z~0.4, 81 \pm 5 Mpc/h at z~0.6, and 75 \pm 4 Mpc/h at z~0.8. We demonstrate the robustness of our results against selection function effects, using a LCDM N-body simulation and a suite of inhomogeneous fractal distributions. The results are in excellent agreement with both the LCDM N-body simulation and an analytical LCDM prediction. We can exclude a fractal distribution with fractal dimension below D_2=2.97 on scales from ~80 Mpc/h up to the largest scales probed by our measurement, ~300 Mpc/h, at 99.99% confidence.Comment: 21 pages, 16 figures, accepted for publication in MNRA

    How to sequence 10,000 bacterial genomes and retain your sanity: an accessible, efficient and global approach

    Get PDF
    Non-typhoidal Salmonella(NTS)are typically associated with enterocolitis and linked to the industrialisation of food production. In recent years, NTS has been associated with invasive disease (iNTS disease) causing an estimated 77,000 deaths each year worldwide; 80% of mortality occurs in sub-Saharan Africa. New clades of S. Typhimurium and S. Enteritidis have been identified, which are characterised by genomic degradation, altered prophage repertoires and novel multidrug resistant plasmids. To understand how these clades are contributing to the burden and severity of iNTS disease, it is crucial to expand genome-based surveillance to cover more countries, and incorporate historical isolates to generate an evolutionary timeline of the development of iNTS. We developedand validateda robust and inexpensive method for large-scale collection and sequencing of bacterial genomes. The “10,000 Salmonella genomes” project established a worldwide research collaboration to generate information relevant to the epidemiology, drug resistance and virulence factors of Salmonellae using a whole-genome sequencing approach. By streamlining collection of isolates and developing an efficient logistics pipeline, we gathered 10,419 clinical and environmental isolates from collections in low and middle-income countries within six months. Genome sequences are now available for isolates from 51 countries/territories dating from 1949 to 2017, with ~80 % representing African and Latin-American datasets. Our method can be applied to other large sample collections that require maximisation of resources within a limited timeframe. Detailed genome analyses are in progress and it is hoped that the resulting data will contribute to public health control strategies in low and middle-income countries
    corecore