18 research outputs found

    High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

    Get PDF
    <div><p>Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application.</p></div

    Runtime and memory requirements comparison of HLA*PRG, PHLAT and HLAReporter on NA12878.

    No full text
    <p>Upper part: NA12878 2 x 100bp reads from the Platinum cohort; lower part: NA12878 2 x 250bp reads from the 1000 Genomes cohort. We provide a detailed analysis in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005151#pcbi.1005151.s014" target="_blank">S1 Text</a>.</p

    Schematic representation of HLA type inference using HLA*PRG.

    No full text
    <p><b>a</b> Broad-scale structure of the HLA PRG. The included genes are separated by spacer blocks consisting of N characters. <b>b</b> Fine-scale structure of the PRG input sequences. Exons, introns and UTRs are embedded in regional haplotypes (padding sequence). Exon sequences typically outnumber intron sequences. The red line indicates the region covered by IMGT genomic sequences. X-axis not to scale. <b>c</b> For each gene represented in the PRG, multiple sequence alignments representing up to 3 sources of sequence data are merged for PRG construction: exonic sequences, genomic (UTR, exons, introns) sequences, regional haplotypes (“xMHC Ref.”). Using alleles present in both the current and the next-higher-level MSA (identifiers printed in red), the merging algorithm determines consensus boundaries (blue bars) to connect the MSAs of different input sequence types. For each segment so-defined, we use the MSA corresponding to the highest-resolution input sequence type (sequence characters therefore ignored are printed in grey). <b>d</b> The PRG corresponding to the input sequences shown in c, and a seed-and-extend alignment of a sequencing read to the PRG. PRG nodes are represented by boxes and edges by labelled arrows. The four blue markers correspond to the consensus MSA boundaries shown in c. The aligned sequence of the read is displayed below the PRG, and the alignment path (the sequence of edges and nodes traversed in the PRG) is highlighted. The red component of the alignment path corresponds to the exact-match “seed” component of the alignment (spanning a graph-encoded gap), whereas the orange components correspond to the “extend” component of the alignment (where mismatches are allowed).</p

    <i>HLA</i> Diversity in the 1000 Genomes Dataset

    No full text
    <div><p>The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation by sequencing at a level that should allow the genome-wide detection of most variants with frequencies as low as 1%. However, in the major histocompatibility complex (MHC), only the top 10 most frequent haplotypes are in the 1% frequency range whereas thousands of haplotypes are present at lower frequencies. Given the limitation of both the coverage and the read length of the sequences generated by the 1000 Genomes Project, the highly variable positions that define <i>HLA</i> alleles may be difficult to identify. We used classical Sanger sequencing techniques to type the <i>HLA-A, HLA-B, HLA-C, HLA-DRB1</i> and <i>HLA-DQB1</i> genes in the available 1000 Genomes samples and combined the results with the 103,310 variants in the MHC region genotyped by the 1000 Genomes Project. Using pairwise identity-by-descent distances between individuals and principal component analysis, we established the relationship between ancestry and genetic diversity in the MHC region. As expected, both the MHC variants and the <i>HLA</i> phenotype can identify the major ancestry lineage, informed mainly by the most frequent <i>HLA</i> haplotypes. To some extent, regions of the genome with similar genetic or similar recombination rate have similar properties. An MHC-centric analysis underlines departures between the ancestral background of the MHC and the genome-wide picture. Our analysis of linkage disequilibrium (LD) decay in these samples suggests that overestimation of pairwise LD occurs due to a limited sampling of the MHC diversity. This collection of <i>HLA</i>-specific MHC variants, available on the <i>db</i>MHC portal, is a valuable resource for future analyses of the role of MHC in population and disease studies.</p></div

    Spearman’s rank correlation coefficients <i>ρ</i> between regional sub-files sizes and cumulated HF and corresponding <i>p</i> values.

    No full text
    <p>For <i>n</i> = 50,000 (corresponding to ten regional sub-files with size <i>n</i> = 5,000), correlations between the regional sub-file sizes of the total data set (<i>n</i> = 123,749) and the cumulated HF of the regional random samples are displayed.</p><p>HF = haplotype frequency.</p

    Principal Component analysis of the pairwise IBD distances between 1000 Genomes samples using MHC region marker (A), genome-wide markers (B), and using markers of regions with similar variants' density (C, chr9 : 116,750,000–121,650,000), with a recombination rate (D, chr9:800,000–5,700,000).

    No full text
    <p>(A) <i>The presence of the most frequent ancestry specific HLA haplotype in the samples of the 1000 Genomes project using MHC region markers</i>. Principal component analysis of the 103 K variants from the MHC region in the 1000 Genomes samples. PC1 captures 6.00% of total variance; PC2 captures 5.05%. The PCA analysis is based on publicly available SNPs. In order to integrate the SNP based information to the HLA allele information, individual spots are replaced by letters when a frequent <i>HLA</i> haplotype is predicted when the <i>HLA</i> typing is phased using <i>HLA</i> haplotype frequencies. The so called “frequent” haplotypes are defined in an ancestry specific manner: P for frequent <i>HLA</i> haplotypes in Europeans, S for frequent <i>HLA</i> haplotype in Asians, H for frequent <i>HLA</i> haplotype in Hispanics and F for frequent haplotype in Africans. The detailed list of the frequent haplotypes is presented in supplementary information. Frequent haplotypes and definition of overlap between ancestries were documented in a recent modeling effort for the development of haplobank. (B) <i>Principal Component analysis of the pairwise IBD distances between 1000 Genomes samples using genome-wide markers</i>. Principal component analysis of 100 K variants selected at random throughout of the genome in the 1000 Genomes samples. PC1 captures 55.16% of total variance PC2 captures 41.96%. The representation of distances computed from genome-wide SNPS clearly identifies samples of European, Asian and African ancestries. The results are consistent with self-declared ancestry and the admixed nature of several populations. There are however a few notable exceptions: NA20314 from south west African Americans (ASW) clusters with Mexicans (MXL), NA20291 from ASW clusters with LWK, and HG01108 from the Puerto Rican (PUR) who clusters with the majority of Africans Americans (ASW). In addition, four Columbians (CLM: HG01342, HG01390, HG01462, HG01551) and three African Americans (ASW: NA20278, NA20299, NA20414) cluster together away from their groups. These are also clustering far from their self-declared ancestry in the MHC centered analysis. This most likely reflects their genome-wide ancestry rather than a different ancestry of the MHC. (C) <i>Principal Component analysis of the pairwise IBD distances of 1000 Genomes samples using genome-wide markers of a region (chr9 : 116,750,000–121,650,000) with a variants' density that is similar to the</i> MHC <i>region</i>. Principal component analysis of 100 K variants selected at random throughout of the genome in the 1000 Genomes samples. PC1 captures 2.98% of total variance PC2 captures 1.56%. The representation of distances computed from genome-wide SNPS clearly identifies samples of European, Asian and African ancestries. PC1 and PC2 have been flipped to ease the comparison of the patterns in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0097282#pone-0097282-g001" target="_blank">Figures 1A and 1B</a>. (D) <i>Principal Component analysis of the pairwise IBD distances of 1000 Genomes samples using genome-wide markers of a region</i> (chr9:800,000–5,700,000) <i>with an avergage recombination rate that is similar to the</i> MHC <i>region</i>. Principal component analysis of 100 K variants selected at random throughout of the genome in the 1000 Genomes samples. PC1 captures 2.55% of total variance PC2 captures 1.57%. The representation of distances computed from genome-wide SNPS clearly identifies samples of European, Asian and African ancestries. PC1 and PC2 have been flipped to ease the comparison of the patterns in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0097282#pone-0097282-g001" target="_blank">Figures 1A and 1B</a>.</p

    Regional HLA Differences in Poland and Their Effect on Stem Cell Donor Registry Planning

    Get PDF
    <div><p>Regional HLA frequency differences are of potential relevance for the optimization of stem cell donor recruitment. We analyzed a very large sample (<i>n</i> = 123,749) of registered Polish stem cell donors. Donor figures by 1-digit postal code regions ranged from <i>n</i> = 5,243 (region 9) to <i>n</i> = 19,661 (region 8). Simulations based on region-specific haplotype frequencies showed that donor recruitment in regions 0, 2, 3 and 4 (mainly located in the south-eastern part of Poland) resulted in an above-average increase of matching probabilities for Polish patients. Regions 1, 7, 8, 9 (mainly located in the northern part of Poland) showed an opposite behavior. However, HLA frequency differences between regions were generally small. A strong indication for regionally focused donor recruitment efforts can, therefore, not be derived from our analyses. Results of haplotype frequency estimations showed sample size effects even for sizes between <i>n</i>≈5,000 and <i>n</i>≈20,000. This observation deserves further attention as most published haplotype frequency estimations are based on much smaller samples.</p></div

    Overview of the 1000 Genomes project samples typed for HLA genes.

    No full text
    <p>Ibericos from Spain (n = 14) were genotyped in the KGP but were not available for <i>HLA</i> typing at the time of the project. Chinese Han from Denver were typed for <i>HLA</i> they are currently publically available for sequencing data.</p
    corecore