Haplotype-Based Association Studies: Approaches to Current Challenges

Abstract

Haplotype-based association studies have greatly aided researchers in their attempts to map genes. However, current designs of haplotype-based association studies lead to several challenges from a statistical perspective. To reduce the number of variants, some researchers have employed hierarchical clustering. This thesis starts by addressing the multiple testing problem that results from applying a hierarchical clustering procedure to haplotypes and then performing a statistical test for association at each of the steps in the resulting hierarchy. Applying our method to a haplotype case-control dataset, we find a global p-value. Relative to the minimum p-value over all steps in the hierarchy, the global p-value is markedly inflated. The second challenge involves the inherent errors present when prediction programs are employed to assign haplotype pairs for each individual in a haplotype-based association study. We examined the effect of these misclassification errors on the false positive rate and power for two association tests—the standard likelihood ratio test (LRTstd) and a likelihood ratio test that allows for the misclassification inherent in the haplotype inference procedure (LRTae). Our simulations indicate that 1) for each statistic permutation methods maintain the correct type I error; 2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and 3) a significant power gain exists for the LRTae over the LRTstd for a subset of the parameter settings. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole genome association study. The third challenge addressed by this thesis involves the uncertainty regarding the exact distribution of the likelihood ratio test (LRT) statistic for haplotype-based association tests in which many of the haplotype frequency estimates are zero or very small. By simulating datasets with known haplotype frequencies and comparing the empirical distribution with various theoretical distributions, we characterized the distribution of the LRT statistic as a χ2 distribution where the degrees of freedom are related to the number of the haplotypes with nonzero frequency estimates. Awareness of the potential pitfalls and the strategies to address them will increase the effectiveness of haplotype-based association as a gene-mapping tool

    Similar works