13 research outputs found

    A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

    Get PDF
    Background: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. Results: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a “subpopulation aware” 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). Conclusions: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment. © 2024, The Author(s).Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

    An imputation platform to enhance integration of rice genetic resources

    No full text
    As sequencing and genotyping technologies evolve, crop genetics researchers accumulate increasing numbers of genomic data sets from various genotyping platforms on different germplasm panels. Imputation is an effective approach to increase marker density of existing data sets toward the goal of integrating resources for downstream applications. While a number of imputation software packages are available, the limitations to utilization for the rice community include high computational demand and lack of a reference panel. To address these challenges, we develop the Rice Imputation Server, a publicly available web application leveraging genetic information from a globally diverse rice reference panel assembled here. This resource allows researchers to benefit from increased marker density without needing to perform imputation on their own machines. We demonstrate improvements that imputed data provide to rice genome-wide association (GWA) results of grain amylose content and show that the major functional nucleotide polymorphism is tagged only in the imputed data set

    An imputation platform to enhance integration of rice genetic resources

    No full text
    As sequencing and genotyping technologies evolve, crop genetics researchers accumulate increasing numbers of genomic data sets from various genotyping platforms on different germplasm panels. Imputation is an effective approach to increase marker density of existing data sets toward the goal of integrating resources for downstream applications. While a number of imputation software packages are available, the limitations to utilization for the rice community include high computational demand and lack of a reference panel. To address these challenges, we develop the Rice Imputation Server, a publicly available web application leveraging genetic information from a globally diverse rice reference panel assembled here. This resource allows researchers to benefit from increased marker density without needing to perform imputation on their own machines. We demonstrate improvements that imputed data provide to rice genome-wide association (GWA) results of grain amylose content and show that the major functional nucleotide polymorphism is tagged only in the imputed data set

    Discovery of genomic variants associated with genebank historical traits for rice improvement: SNP and indel data, phenotypic data, and GWAS results

    No full text
    This dataset provides supporting information for Sanciangco et al (submitted) consisting of: A) file list, tables of phenotypes for quantitative and categorical traits and trait descriptions, and tables of SNP/indel numbers for Filtered, LD-pruned and subpopulation datasets (7 files named as "00_*"); B) plink files for Filtered and LD-pruned SNP/indel datasets for all genotypes and for indica, japonica and aus subsets (15 fIles named as "01_*"); C) EMMAX results on Filtered dataset for 12 quantitative traits on All, Aus, Indica, and Japonica genotypes and corresponding Manhattan and QQ plots (144 files named as "0[2345]_*"); D) EMMAX results on LD-pruned dataset for 12 quantitative traits on All, Aus, Indica, and Japonica genotypes and corresponding Manhattan and QQ plots (72 files named as "0[6789]_*"); E) EMMAX results on LD-pruned dataset for 20 categorical traits treated as numeric on All genotypes and corresponding Manhattan and Q-Q plots (60 files named as "10_*"); F) Anova results obtained on numerically transformed LD-pruned dataset for 20 categorical traits on All genotypes and corresponding Manhattan plots (40 files named as "11_*")

    A platinum standard pan-genome resource that represents the population structure of Asian rice [Data paper]

    No full text
    As the human population grows from 7.8 billion to 10 billion over the next 30 years, breeders must do everything possible to create crops that are highly productive and nutritious, while simultaneously having less of an environmental footprint. Rice will play a critical role in meeting this demand and thus, knowledge of the full repertoire of genetic diversity that exists in germplasm banks across the globe is required. To meet this demand, we describe the generation, validation and preliminary analyses of transposable element and long-range structural variation content of 12 near-gap-free reference genome sequences (RefSeqs) from representatives of 12 of 15 subpopulations of cultivated Asian rice. When combined with 4 existing RefSeqs, that represent the 3 remaining rice subpopulations and the largest admixed population, this collection of 16 Platinum Standard RefSeqs (PSRefSeq) can be used as a template to map resequencing data to detect virtually all standing natural variation that exists in the pan-genome of cultivated Asian rice

    State of ex situ conservation of landrace groups of 25 major crops

    Get PDF
    Crop landraces have unique local agroecological and societal functions and offer important genetic resources for plant breeding. Recognition of the value of landrace diversity and concern about its erosion on farms have led to sustained efforts to establish ex situ collections worldwide. The degree to which these efforts have succeeded in conserving landraces has not been comprehensively assessed. Here we modelled the potential distributions of eco-geographically distinguishable groups of landraces of 25 cereal, pulse and starchy root/tuber/fruit crops within their geographic regions of diversity. We then analysed the extent to which these landrace groups are represented in genebank collections, using geographic and ecological coverage metrics as a proxy for genetic diversity. We find that ex situ conservation of landrace groups is currently moderately comprehensive on average, with substantial variation among crops; a mean of 63% ± 12.6% of distributions is currently represented in genebanks. Breadfruit, bananas and plantains, lentils, common beans, chickpeas, barley and bread wheat landrace groups are among the most fully represented, whereas the largest conservation gaps persist for pearl millet, yams, finger millet, groundnut, potatoes and peas. Geographic regions prioritized for further collection of landrace groups for ex situ conservation include South Asia, the Mediterranean and West Asia, Mesoamerica, sub-Saharan Africa, the Andean mountains of South America and Central to East Asia. With further progress to fill these gaps, a high degree of representation of landrace group diversity in genebanks is feasible globally, thus fulfilling international targets for their ex situ conservation
    corecore