19 research outputs found
Comparison of phasing strategies for whole human genomes
<div><p>Humans are a diploid species that inherit one set of chromosomes paternally and one homologous set of chromosomes maternally. Unfortunately, most human sequencing initiatives ignore this fact in that they do not directly delineate the nucleotide content of the maternal and paternal copies of the 23 chromosomes individuals possess (i.e., they do not ‘phase’ the genome) often because of the costs and complexities of doing so. We compared 11 different widely-used approaches to phasing human genomes using the publicly available ‘Genome-In-A-Bottle’ (GIAB) phased version of the NA12878 genome as a gold standard. The phasing strategies we compared included laboratory-based assays that prepare DNA in unique ways to facilitate phasing as well as purely computational approaches that seek to reconstruct phase information from general sequencing reads and constructs or population-level haplotype frequency information obtained through a reference panel of haplotypes. To assess the performance of the 11 approaches, we used metrics that included, among others, switch error rates, haplotype block lengths, the proportion of fully phase-resolved genes, phasing accuracy and yield between pairs of SNVs. Our comparisons suggest that a hybrid or combined approach that leverages: 1. population-based phasing using the SHAPEIT software suite, 2. either genome-wide sequencing read data or parental genotypes, and 3. a large reference panel of variant and haplotype frequencies, provides a fast and efficient way to produce highly accurate phase-resolved individual human genomes. We found that for population-based approaches, phasing performance is enhanced with the addition of genome-wide read data; e.g., whole genome shotgun and/or RNA sequencing reads. Further, we found that the inclusion of parental genotype data within a population-based phasing strategy can provide as much as a ten-fold reduction in phasing errors. We also considered a majority voting scheme for the construction of a consensus haplotype combining multiple predictions for enhanced performance and site coverage. Finally, we also identified DNA sequence signatures associated with the genomic regions harboring phasing switch errors, which included regions of low polymorphism or SNV density.</p></div
Comparing the genomic location of switch errors across phasing approaches.
<p>Comparing the genomic location of switch errors across phasing approaches.</p
Switch error rates across phasing strategies as a function of minor allele frequency.
<p>(A) Laboratory-based phasing, (B) Read-based and majority voting, (C) Population-based phasing, (D) Hybrid population and read-based, and (E) Hybrid population and familial data from parental genotype.</p
Phasing accuracy of SHAPEIT approaches and the choice of reference panels used.
<p>(a) Effect of population supergroups on phasing accuracy. Five supergroups of the same size (n = 347) were collected from the 1000GP and used as the reference panel for SHAPEIT (no read) phasing, or together with Illumina or PacBio reads for the NA12878 individual. The best SER was achieved by EUR, to which the NA12878 individual belongs. (b) Effect of population subgroups on phasing accuracy. Population subgroups of the same size (n = 85) were collected from the 1000GP, EUR, and each of five subpopulations in EUR and used as the reference panel for SHAPEIT phasing of the NA12878 individual. No major improvement on SER was observed among EUR and its 5 subgroups including EUR/CEU to which the individual NA12878 belongs. (c) Effect on phasing accuracy as SER as a function of reference panel size, compared with the inclusion of WGS reads or familial information from parental genotype. Reference panels containing up to 502 individuals from the 1000GP EUR group or 23k individuals from HRC were used as the population background for SHAPEIT phasing of the NA12878 individual.</p
Performance summary of population-based phasing approaches supplemented with sequence reads and/or parental genotype information.
<p>Performance summary of population-based phasing approaches supplemented with sequence reads and/or parental genotype information.</p
Phasing accuracy and haplotype diversity.
<p>Switch error rates for various strategies are shown as a function of haplotype diversity of a reference population based on the 1000GP reference panel.</p
Phasing accuracy of disease-associated genes for the reference individual NA12878.
<p>Phasing accuracy of disease-associated genes for the reference individual NA12878.</p
Phasing performance comparison based on pairwise SNV haplotype assignment.
<p>(A) Phasing accuracy. Probability that a pair of SNVs on the same phasing block is correctly phased with respect to each other as a function of the distance between the pair. (B) Phasing yield. Probability that a pair of SNVs are phased in the same phasing block as a function of the distance between the pair.</p
Phasing accuracy and SNV density.
<p>(A) Basic phasing approaches. (B) SHAPEIT phasing supplemented with reference panel, sequence read, or parental genotype information. Switch error rates for various phasing strategies are shown as a function of distance between a heterozygous site and its upstream phased site.</p
Performance summary across experimental-, population-, read-based, and majority vote phasing approaches.
<p>Performance summary across experimental-, population-, read-based, and majority vote phasing approaches.</p