27 research outputs found

    Targeted identification of genomic regions using TAGdb

    Get PDF
    Background: The introduction of second generation sequencing technology has enabled the cost effective sequencing of genomes and the identification of large numbers of genes and gene promoters. However, the assembly of DNA sequences to create a representation of the complete genome sequence remains costly, especially for the larger and more complex plant genomes. Results: We have developed an online database, TAGdb, that enables researchers to identify paired read sequences that share identity with a submitted query sequence. These tags can be used to design oligonucleotide primers for the PCR amplification of the region in the target genome. Conclusions: The ability to produce large numbers of paired read genome tags using second generation sequencing provides a cost effective method for the identification of genes and promoters in large, complex or orphan species without the need for whole genome assembly

    A Single Molecule Scaffold for the Maize Genome

    Get PDF
    About 85% of the maize genome consists of highly repetitive sequences that are interspersed by low-copy, gene-coding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence build. For this purpose, we constructed a genome-wide, high-resolution optical map of the maize inbred line B73 genome containing >91,000 restriction sites (averaging 1 site/∼23 kb) accrued from mapping genomic DNA molecules. Our optical map comprises 66 contigs, averaging 31.88 Mb in size and spanning 91.5% (2,103.93 Mb/∼2,300 Mb) of the maize genome. A new algorithm was created that considered both optical map and unfinished BAC sequence data for placing 60/66 (2,032.42 Mb) optical map contigs onto the maize iMap. The alignment of optical maps against numerous data sources yielded comprehensive results that proved revealing and productive. For example, gaps were uncovered and characterized within the iMap, the FPC (fingerprinted contigs) map, and the chromosome-wide pseudomolecules. Such alignments also suggested amended placements of FPC contigs on the maize genetic map and proactively guided the assembly of chromosome-wide pseudomolecules, especially within complex genomic regions. Lastly, we think that the full integration of B73 optical maps with the maize iMap would greatly facilitate maize sequence finishing efforts that would make it a valuable reference for comparative studies among cereals, or other maize inbred lines and cultivars

    The development and application of methods and tools for the assembly and analysis of second generation sequence data

    No full text
    Any modern approach for developing a thorough understanding of any particular organism or group of organisms will at some stage involve determining all or part of their corresponding DNA or RNA sequences. DNA sequencing is commonly used to gain insight into a wide array of biological processes. Improvements in technologies and processes employed to gather information about the biological world have lead to the accumulation of enormous amounts of data which must be filtered, sorted and studied; a task beyond the capabilities of the human mind alone. Increasingly the domain of biology has become fused with the domains of information technology and mathematics. Computational systems designed to shift the burden of data processing away from scientists have evolved into systems of such complexity as to become areas of study in their own right. This thesis describes the design and implementation of a number of sequence-based bioinformatics analyses and tools, and their applications in the fields of genomics and plant genome research. Almost all of the tools described here have been designed to work exclusively with data produced using second generation sequencing (2GS) technologies. Included in this thesis is a description of a novel 2GS de novo assembly algorithm called SaSSY To demonstrate how SaSSY is being applied in current research, a . selection of projects the Author is involved with that have either used, or are currently using SaSSY are also described. These include the coral genome sequencing project, two comparative genomics projects involving the de novo assembly of BAC sequences from Secale cereale (rye) and Brassica rapa (rapeseed), and a project that aims to compare differences between different mitochondrial and chloroplast sequences in a variety of legumes. Also presented are summaries of the Author's role in the development of three bioinformatics software packages: autoSNPdb; a web based SNP detection and visualisation application, TagDB; a web based short read mapping and visualisation application, and BGA; an annotation pipeline developed primarily for annotating plant derived BAC and cDNA sequences. 2GS technologies have significantly influenced the direction, scope and perceived limitations of biological research as a whole and have particularly influenced the area of bioinformatics. It is becoming increasingly apparent that further revolutions in sequencing technology are expected to occur in the very near future indicating that research in this area will continue to grow, at an ever increasing pace

    CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes: supplemental material

    No full text
    Supplementary ResultsRefinement for Gene Loss and DuplicationEstimates under Opal Stop Codon RecodingsSupplementary MethodsIdentification of Trusted Reference GenomesRefining Marker Sets for Lineage-specific Gene Loss and DuplicationDetermination of Coding TableSystematic Bias of Completeness and Contamination EstimatesSupplemental Figure S1. Distribution of the 104 bacterial and 281 gammaproteobacterial marker genes around the E. coli K12 genome.Supplemental Figure S2. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the random contig model.Supplemental Figure S3. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the inverse length model. Supplemental Figure S4. Maximum-likelihood genome tree inferred from 5656 reference genomes. Supplemental Figure S5. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the random fragment model using a window size of 20 kbp.Supplemental Figure S6. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the inverse length model. Supplemental Figure S7. Error in completeness and contamination estimates on simulated genomes from different phyla.Supplemental Figure S8. Bias in completeness and contamination estimates when modelled as a binomial distribution. Supplemental Figure S9. GC-distribution plots of the HMP Capnocytophaga sp. oral taxon 329 genome.Supplemental Figure S10. Phylogenetic placement of the two genomes (Cluster 0 and Cluster 1) identified within the HMP Capnocytophaga sp. oral taxon 329 genome.Supplemental Figure S11. Completeness estimates for 90 putative population genomes recovered from an acetate-amended aquifer.Supplemental Figure S12. Contamination estimates for 90 putative population genomes recovered from an acetate-amended aquifer.Supplemental Figure S13. Identification of the 213 marker genes within the Meyerdierks et al. (2010) ANME-1 genome.Supplemental Figure S14. Refining a marker set for lineage-specific gene loss and duplication. Supplemental TablesSupplemental Table S1. Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using different universal- and domain-specific marker gene sets.Supplemental Table S2. Number of marker genes and marker sets for taxonomic groups with ≥ 20 reference genomes. Supplemental Table S3. Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS). Supplemental Table S4. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS).Supplemental Table S5. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS). Supplemental Table S6. Phylogenetically informative marker genes used to infer the reference genome tree along with matching PhyloSift genes.Supplemental Table S7. Phylogenetically informative genes used in PhyloSift without a matching CheckM gene.Supplemental Table S8. Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms).Supplemental Table S9. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms). Supplemental Table S10. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms).Supplemental Table S11. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker set selected by CheckM (sms). Supplemental Table S12. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker sets selected by CheckM (sms). Supplemental Table S13. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker sets selected by CheckM (sms).Supplemental Table S14. Taxonomic rank of the selected lineage-specific marker set used for evaluating the quality of genomes at different degrees of taxonomic novelty. Supplemental Table S15. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates for simulated genomes at different degrees of taxonomic novelty.Supplemental Table S16. Lineage-specific completeness and contamination estimates for isolate genomes from large-scale sequencing initiatives.(see Excel file)Supplemental Table S17. Completeness and contamination estimates of the Lactobacillus gasseri MV-22 genome for increasingly basal lineage-specific marker sets.Supplemental Table S18. Bacterial marker genes identified within the HMP Lactobacillus gasseri genomes. Markers missing from a genome or present in multiple copies are highlighted with a grey background. Supplemental Table S19. Lineage-specific completeness and contamination estimates for genomes annotated as finished at IMG, along with predicted translation tables and calculated coding density. (see Excel file)Supplemental Table S20: Lineage-specific completeness and contamination estimates for single-cell genomes from the GEBA-MDM initiative along with traditional assembly statistics. (see Excel file)Supplemental Table S21: Lineage-specific completeness and contamination estimates for population genomes, plasmids, and phage recovered from metagenomic datasets along with traditional assembly statistics. (see Excel file)Supplemental Table S22: Completeness and contamination estimates for population genomes recovered from an acetate-amended aquifer determined using domain-level and lineage-specific marker sets. (see Excel file

    CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

    No full text
    Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.</jats:p

    GroopM: an automated tool for the recovery of population genomes from related metagenomes

    No full text
    Metagenomic binning methods that leverage differential population abundances in microbial communities (differential coverage) are emerging as a complementary approach to conventional composition-based binning. Here we introduce GroopM, an automated binning tool that primarily uses differential coverage to obtain high fidelity population genomes from related metagenomes. We demonstrate the effectiveness of GroopM using synthetic and real-world metagenomes, and show that GroopM produces results comparable with more time consuming, labor-intensive methods

    CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

    No full text
    Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities

    Spatial uniformity of microbial diversity in a continuous bioelectrochemical system

    No full text
    Bioelectrochemical systems (BESs) are emerging as a technology with diverse future applications. Anode-associated microbial diversity and activity are known to change over time, but the consequences of these dynamics on BES functioning are poorly understood. A novel BES with exchangeable anodic electrodes that facilitates characterisation of microbial communities over time was constructed. The BES, received a mixture of volatile fatty acids and produced 0.13 mA cm−2 of anodic electrode surface, leading to the removal of 14 g chemical oxygen demand per m2 electrode per day at a coulombic efficiency of 76%. Pyrosequencing of 16S rRNA genes revealed no differences in the diversity of microbial communities associated with different electrodes within a single time point. This finding validates the design for temporal studies as changes in microbial diversity observed over time can be related to community development rather than spatial variation within the reactor

    Anaerobic oxidation of methane coupled to nitrate reduction in a novel archaeal lineage

    No full text
    Anaerobic oxidation of methane (AOM) is critical for controlling the flux of methane from anoxic environments. AOM coupled to iron1, manganese1 and sulphate2 reduction have been demonstrated in consortia containing anaerobic methanotrophic (ANME) archaea. More recently it has been shown that the bacterium Candidatus ‘Methylomirabilis oxyfera’ can couple AOM to nitrite reduction through an intra-aerobic methane oxidation pathway3. Bioreactors capable of AOM coupled to denitrification have resulted in the enrichment of ‘M. oxyfera’ and a novel ANME lineage, ANME-2d4,5. However, as ‘M. oxyfera’ can independently couple AOM to denitrification, the role of ANME-2d in the process is unresolved. Here, a bioreactor fed with nitrate, ammonium and methane was dominated by a single ANME-2d population performing nitrate-driven AOM. Metagenomic, single-cell genomic and metatranscriptomic analyses combined with bioreactor performance and 13C- and 15N-labelling experiments show that ANME-2d is capable of independent AOM through reverse methanogenesis using nitrate as the terminal electron acceptor. Comparative analyses reveal that the genes for nitrate reduction were transferred laterally from a bacterial donor, suggesting selection for this novel process within ANME-2d. Nitrite produced by ANME-2d is reduced to dinitrogen gas through a syntrophic relationship with an anaerobic ammonium-oxidizing bacterium, effectively outcompeting ‘M. oxyfera’ in the system. We propose the name Candidatus ‘Methanoperedens nitroreducens’ for the ANME-2d population and the family Candidatus ‘Methanoperedenaceae’ for the ANME-2d lineage. We predict that ‘M. nitroreducens’ and other members of the ‘Methanoperedenaceae’ have an important role in linking the global carbon and nitrogen cycles in anoxic environments

    Erratum: Anaerobic oxidation of methane coupled to nitrate reduction in a novel archaeal lineage

    No full text
    Nature 500, 567–570 (2013); doi:10.1038/nature12375In this Letter, equation (1) was inadvertently shown incorrectly, with CO2 missing from the reaction products. The correct equation (1) is shown below:This has been corrected in the HTML and PDF versions of the original manuscript
    corecore