10 research outputs found
Human Whole-Exome Genotype Data For alzheimer\u27s Disease
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer\u27s Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD \u3e 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community
Recommended from our members
ADSP Whole Genome Sequencing (WGS) Release 3 data update from Genome Center for Alzheimer’s Disease
Background
The Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration and meta‐analysis of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated via the collaboration of scientists from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the vision to minimize data heterogeneity, introduced by different sequencing protocols and machines, GCAD processes all samples using identical pipelines and performs quality assurance (QA) checks.
Methods
Raw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping were done by GATK. Furthermore, Smoove, Manta and Streka were applied to generate structural variant (SV) calls per sample. QA checks including sex, contamination and genotype concordance as well as the ADSP QC protocol were performed to evaluate the quality of samples and variants. To facilitate the access and usage of the big joint‐genotyped VCF files, we introduced a compact version for storing variant info and sample genotypes only.
Results
We dropped 235 (1.3%) samples of poor coverage (30x). All samples’ CRAMs, gVCFs from GATK, and VCFs from the three SV callers were deposited into NIAGADS Data Sharing Service (DSS) (https://dss.niagads.org/) for public distribution. In addition, joint‐genotype VCFs are available in both compact and QC versions. This joint‐genotype VCF contains >206M bi‐allelic single‐nucleotide variants, 16M bi‐allelic indels and 28M multi‐allelic variants, with 96% of variants remaining after stringent QC.
Conclusion
The ADSP and GCAD generate high quality genotype calls and SV calls. Currently the project is processing ∼37,000 WGS samples sequenced primarily through the ADSP Follow‐Up Study, which will contain a more ancestrally diverse set of populations. We anticipate this 2022 release will continue to benefit the research community studying AD genetics
Apolipoprotein E Genotype and Sex Risk Factors for Alzheimer Disease: A Meta-analysis
It is unclear whether female carriers of the apolipoprotein E (APOE) ε4 allele are at greater risk of developing Alzheimer disease (AD) than men, and the sex-dependent association of mild cognitive impairment (MCI) and APOE has not been established
Human whole-exome genotype data for Alzheimer's disease
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer's Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community
Recommended from our members
ADSP Whole Genome Sequencing (WGS) Release 4 Data Update from Genome Center for Alzheimer’s Disease
Abstract Background The Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated through collaboration between investigators from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the goal of minimizing data heterogeneity, introduced by different sequencing protocols and assays, GCAD processes all samples using standardized pipelines and performs quality control (QC)/quality assurance (QA) checks. Methods Raw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping on single nucleotide variants (SNVs), insertions and deletions (indels), were done by GATK. Structural variants (SVs) were called per sample using the Smoove, Manta, and Strelka packages. Preliminary QA checks including sex check, contamination, and genotype concordance were performed followed by QC per ADSP protocol to evaluate the quality of samples and variants. To facilitate access and usage of massive joint‐genotype called VCF files, a compact version for storing variant info and sample genotypes only was released first. Results We dropped 275 (0.7%) samples of poor coverage (362M bi‐allelic variants, >58M multi‐allelic variants, with 95% of variants remaining after QC. SV calling is ongoing and data will be ready prior to the conference. Conclusion The ADSP and GCAD generate high quality SNVs, indels and SV calls. Currently GCAD is preparing the next release of ∼60,000 more ancestrally‐diverse WGS samples sequenced primarily through the ADSP Follow‐Up Study, which we anticipate will be released in 2023 to greatly benefit the AD genetics community
Human whole-exome genotype data for Alzheimer’s disease
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer’s Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community.</p