501 research outputs found

    SNP imputation bias reduces effect size determination

    Get PDF
    Imputation is a commonly used technique that exploits linkage disequilibrium to infer missing genotypes in genetic datasets, using a well characterized reference population. While there is agreement that the reference population has to match the ethnicity of the query dataset, it is common practice to use the same reference to impute genotypes for a wide variety of phenotypes. We hypothesized that using a reference composed of samples with a different phenotype than the query dataset would introduce imputation bias.To test this hypothesis we used GWAS datasets from amyotrophic lateral sclerosis, Parkinson disease, and Crohn disease. First, we masked and then performed imputation of 100 disease-associated markers and 100 non-associated markers from each study. Two references for imputation were used in parallel: one consisting of healthy controls and another consisting of patients with the same disease. We assessed the discordance (imprecision) and bias (inaccuracy) of imputation by comparing predicted genotypes to those assayed by SNP-chip. We also assessed the bias on the observed effect size when the predicted genotypes were used in a GWAS study.When healthy controls were used as reference for imputation, a significant bias was observed, particularly in the disease-associated markers. Using cases as reference significantly attenuated this bias. For nearly all markers, the direction of the bias favored the non-risk allele. In GWAS studies of the three diseases (with healthy reference controls from the 1000 genomes as reference), the mean OR for disease-associated markers obtained by imputation was lower than that obtained using original assayed genotypes.We found that the bias is inherent to imputation as using different methods did not alter the results. In conclusion, imputation is a powerful method to predict genotypes and estimate genetic risk for GWAS. However, a careful choice of reference population is needed to minimize biases inherent to this approac

    SNP imputation in association studies

    Get PDF
    The rationale that underlies imputation methods is that even though the causal SNP may not have been genotyped in the study at hand, it may have been genotyped in the reference population. In this case, simulations have revealed that the imputation of SNPs that appear in the reference population facilitates detection of association. Imputation methods are also invaluable when multiple data sets, the same haplotype distribution for every set of SNPs. Thus, the structure of the linkage disequilibrium in the reference population, in conjunction with the structure of the linkage disequilibrium of the observed SNPs within both the cases and the controls, can be used to impute the alleles of a hidden SNP. Imputed SNPs can then be tested for association using an appropriate statistical test. T he large amount of data generated in whole-genome association studies, involving hundreds of thousands of SNPs genotyped in thousands of individuals, complicates the statistical and computational analysis of that data. The correlation between SNPs (linkage disequilibrium) enables much of the variation to be captured despite the inability to genotype all SNPs, and our previous primer 1 described how tagSNPs and haplotypes have been used as proxies for neighboring associations. However, especially with the advent of high-throughput genotyping technologies, the key challenge has started to shift from identifying tagSNPs that best capture genetic variation in the population to the ability to interrogate SNPs not covered by these technologies. Moreover, how does one consolidate distinct data sets when subsets of the same population are genotyped with slightly different technologies that have different capacities? Imputation methods address these problems by using the linkage disequilibrium structure in a region to infer the alleles of SNPs not directly genotyped in the study (hidden SNPs). The starting point of imputation methods is a reference data set such as the HapMap, in which a large set of SNPs is being genotyped. The underlying assumption is that the reference samples, the cases and the controls are all sampled from the same population. Under this assumption, the three populations share the same linkage disequilibrium structure and Every circle is a state, each column corresponds to a SNP and each row corresponds to an ancestral haplotype. According to this model, a haplotype is generated by a random walk on the Markov chain from left to right, where the transition probabilities from one haplotype to another (denoted by the dashed arrows) are determined by the recombination rate and physical distance between the two SNPs. At each position, there is a small probability that the resulting haplotype will be mutated further. A genotype is generated at the conjunction of two such haplotypes. (b) A perfect phylogeny tree explaining the genealogy of the haplotypes, and leading to a test of the hidden SNP 6. Each node in the tree corresponds to a haplotype, and each edge corresponds to a mutating position. A perfect phylogeny model assumes no recurrent mutations or recombination events. The dashed line corresponds to an unobserved SNP (at position 6), which can be tested for association by testing the haplotypes spanned by SNPs 4 and 5. SNP imputation in association studies P r i m e

    SNP imputation in association studies

    Get PDF
    The rationale that underlies imputation methods is that even though the causal SNP may not have been genotyped in the study at hand, it may have been genotyped in the reference population. In this case, simulations have revealed that the imputation of SNPs that appear in the reference population facilitates detection of association. Imputation methods are also invaluable when multiple data sets, the same haplotype distribution for every set of SNPs. Thus, the structure of the linkage disequilibrium in the reference population, in conjunction with the structure of the linkage disequilibrium of the observed SNPs within both the cases and the controls, can be used to impute the alleles of a hidden SNP. Imputed SNPs can then be tested for association using an appropriate statistical test. T he large amount of data generated in whole-genome association studies, involving hundreds of thousands of SNPs genotyped in thousands of individuals, complicates the statistical and computational analysis of that data. The correlation between SNPs (linkage disequilibrium) enables much of the variation to be captured despite the inability to genotype all SNPs, and our previous primer 1 described how tagSNPs and haplotypes have been used as proxies for neighboring associations. However, especially with the advent of high-throughput genotyping technologies, the key challenge has started to shift from identifying tagSNPs that best capture genetic variation in the population to the ability to interrogate SNPs not covered by these technologies. Moreover, how does one consolidate distinct data sets when subsets of the same population are genotyped with slightly different technologies that have different capacities? Imputation methods address these problems by using the linkage disequilibrium structure in a region to infer the alleles of SNPs not directly genotyped in the study (hidden SNPs). The starting point of imputation methods is a reference data set such as the HapMap, in which a large set of SNPs is being genotyped. The underlying assumption is that the reference samples, the cases and the controls are all sampled from the same population. Under this assumption, the three populations share the same linkage disequilibrium structure and Every circle is a state, each column corresponds to a SNP and each row corresponds to an ancestral haplotype. According to this model, a haplotype is generated by a random walk on the Markov chain from left to right, where the transition probabilities from one haplotype to another (denoted by the dashed arrows) are determined by the recombination rate and physical distance between the two SNPs. At each position, there is a small probability that the resulting haplotype will be mutated further. A genotype is generated at the conjunction of two such haplotypes. (b) A perfect phylogeny tree explaining the genealogy of the haplotypes, and leading to a test of the hidden SNP 6. Each node in the tree corresponds to a haplotype, and each edge corresponds to a mutating position. A perfect phylogeny model assumes no recurrent mutations or recombination events. The dashed line corresponds to an unobserved SNP (at position 6), which can be tested for association by testing the haplotypes spanned by SNPs 4 and 5. SNP imputation in association studies P r i m e

    Utilizing Genotype Imputation for the Augmentation of Sequence Data

    Get PDF
    In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project.Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many "anchor" markers as possible

    The prediction of HLA genotypes from next generation sequencing and genome scan data

    Full text link
    Genome-wide association studies have very successfully found highly significant disease associations with single nucleotide polymorphisms (SNP) in the Major Histocompatibility Complex for adverse drug reactions, autoimmune diseases and infectious diseases. However, the extensive linkage disequilibrium in the region has made it difficult to unravel the HLA alleles underlying these diseases. Here I present two methods to comprehensively predict 4-digit HLA types from the two types of experimental genome data widely available. The Virtual SNP Imputation approach was developed for genome scan data and demonstrated a high precision and recall (96% and 97% respectively) for the prediction of HLA genotypes. A reanalysis of 6 genome-wide association studies using the HLA imputation method identified 18 significant HLA allele associations for 6 autoimmune diseases: 2 in ankylosing spondylitis, 2 in autoimmune thyroid disease, 2 in Crohn's disease, 3 in multiple sclerosis, 2 in psoriasis and 7 in rheumatoid arthritis. The EPIGEN consortium also used the Virtual SNP Imputation approach to detect a novel association of HLA-A*31:01 with adverse reactions to carbamazepine. For the prediction of HLA genotypes from next generation sequencing data, I developed a novel approach using a naïve Bayes algorithm called HLA-Genotyper. The validation results covered whole genome, whole exome and RNA-Seq experimental designs in the European and Yoruba population samples available from the 1000 Genomes Project. The RNA-Seq data gave the best results with an overall precision and recall near 0.99 for Europeans and 0.98 for the Yoruba population. I then successfully used the method on targeted sequencing data to detect significant associations of idiopathic membranous nephropathy with HLA-DRB1*03:01 and HLA-DQA1*05:01 using the 1000 Genomes European subjects as controls. Using the results reported here, researchers may now readily unravel the association of HLA alleles with many diseases from genome scans and next generation sequencing experiments without the expensive and laborious HLA typing of thousands of subjects. Both algorithms enable the analysis of diverse populations to help researchers pinpoint HLA loci with biological roles in infection, inflammation, autoimmunity, aging, mental illness and adverse drug reactions

    Use of partial least squares regression to impute SNP genotypes in Italian Cattle breeds

    Get PDF
    Background The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Methods Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. Results In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Conclusions Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available

    Comparison of Haplotype-based and Tree-based SNP Imputation in Association Studies

    Get PDF
    Missing single nucleotide polymorphisms (SNPs) are quite common in genetic association studies. Subjects with missing SNPs are often discarded in analyses, which may seriously undermine the inference of SNP-disease association. In this article, we compare two haplotype-based imputation approaches and one regression tree-based imputation approach for association studies. The goal is to assess the imputation accuracy, and to evaluate the impact of imputation on parameter estimation. Haplotype-based approaches build on haplotype reconstruction by the expectation-maximization (EM) algorithm or a weighted EM (WEM) algorithm, depending on whether case-control status is taken into account. The tree-based approach uses a Gibbs sampler to iteratively sample from a full conditional distribution, which is obtained from the classification and regression tree (CART) algorithm. We employ a standard multiple imputation procedure to account for the uncertainty of imputation. We apply the methods to simulated data as well as a case-control study on developmental dyslexia. Our results suggest that imputation generally improves over the standard practice of ignoring missing data in terms of bias and efficiency. The haplotype-based approaches slightly outperform the tree-based approach when there are a small number of SNPs in linkage disequilibrium (LD), but the latter has a computational advantage. Finally, we demonstrate that utilizing the disease status in imputation helps to reduce the bias in the subsequent parameter estimation

    Schizophrenia risk loci from xMHC region were associated with antipsychotic response in chronic schizophrenic patients with persistent positive symptom

    Get PDF
    We examined whether common variants from the extended major histocompatibility complex (xMHC) region contribute to the response to antipsychotic drugs (APDs) in patients with schizophrenia with persistent psychosis. Subjects participated in a prospective longitudinal study of the effect of APDs on psychopathology were temporally split into discovery (n = 88) and replication (n = 42) cohorts. The primary endpoint was a change in Brief Psychiatric Rating Scale at 6-week or 6-month after treatment. rs204991 (β = 3.917, p = 3.72 × 10−6), the strongest signal associated with response at 6-week was located near C4A/C4B after a linear regression adjusted for covariates. xMHC SNP imputation disclosed much stronger signals (rs9268469, β = 5.140, p = 1.57 × 10−7) and other weaker signals (p \u3c 1 × 10−5) spanning the entire xMHC region. All the variants were previously identified schizophrenia risk loci. Conditional fine-mapping revealed three subgroups of SNPs which were the eQTLs (p \u3c 1 × 10−7) for C4A, HLA-C, and BTN3A2 in disease-relevant tissue. Epistasis between HLA-C and C4A was observed (p = 0.019). Minor allele (G) carriers of rs204991, eQTL for C4A, having decreased risk for schizophrenia and lower imputed expression of C4A, had a better response to APDs. Some imputed HLA alleles associated with a decreased risk for schizophrenia had a positive association with improvement in psychotic symptoms. An independent cohort validated the association of change in psychosis with C4A. We provide evidence that genetic risk factors for schizophrenia from the xMHC region are associated with response to APDs and those variants significantly alter the imputed expression of C4A, HLA-C, and BTN3A2. The minor alleles predicting higher C4A level are associated with diminished improvement in psychotic symptoms after APD treatment

    Variants at HLA-A , HLA-C , and HLA-DQB1 Confer Risk of Psoriasis Vulgaris in Japanese

    Get PDF
    Psoriasis vulgaris (PsV) is an autoimmune disease of skin and joints with heterogeneity in epidemiologic and genetic landscapes of global populations. We conducted an initial genome-wide association study and a replication study of PsV in the Japanese population (606 PsV cases and 2,052 controls). We identified significant associations of the single nucleotide polymorphisms with PsV risk at TNFAIP3-interacting protein 1and the major histocompatibility complex region (P = 3.7 × 10−10 and 6.6 × 10−15, respectively). By updating the HLA imputation reference panel of Japanese (n = 908) to expand HLA gene coverage, we fine-mapped the HLA variants associated with PsV risk. Although we confirmed the PsV risk of HLA-C*06:02 (odds ratio = 6.36, P = 0.0015), its impact was relatively small compared with those in other populations due to rare allele frequency in Japanese (0.4% in controls). Alternatively, HLA-A*02:07, which corresponds to the cysteine residue at HLA-A amino acid position 99 (HLA-A Cys99), demonstrated the most significant association with PsV (odds ratio = 4.61, P = 1.2 × 10–10). In addition to HLA-A*02:07 and HLA-C*06:02, stepwise conditional analysis identified an independent PsV risk of HLA-DQβ1 Asp57 (odds ratio = 2.19, P = 1.9 × 10–6). Our PsV genome-wide association study in Japanese highlighted the genetic architecture of PsV, including the identification of HLA risk variants
    • …
    corecore