20 research outputs found
Haplotype-Based Association Studies: Approaches to Current Challenges
Haplotype-based association studies have greatly aided researchers in their attempts to map genes. However, current designs of haplotype-based association studies lead to several challenges from a statistical perspective. To reduce the number of variants, some researchers have employed hierarchical clustering. This thesis starts by addressing the multiple testing problem that results from applying a hierarchical clustering procedure to haplotypes and then performing a statistical test for association at each of the steps in the resulting hierarchy. Applying our method to a haplotype case-control dataset, we find a global p-value. Relative to the minimum p-value over all steps in the hierarchy, the global p-value is markedly inflated. The second challenge involves the inherent errors present when prediction programs are employed to assign haplotype pairs for each individual in a haplotype-based association study. We examined the effect of these misclassification errors on the false positive rate and power for two association tests—the standard likelihood ratio test (LRTstd) and a likelihood ratio test that allows for the misclassification inherent in the haplotype inference procedure (LRTae). Our simulations indicate that 1) for each statistic permutation methods maintain the correct type I error; 2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and 3) a significant power gain exists for the LRTae over the LRTstd for a subset of the parameter settings. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole genome association study. The third challenge addressed by this thesis involves the uncertainty regarding the exact distribution of the likelihood ratio test (LRT) statistic for haplotype-based association tests in which many of the haplotype frequency estimates are zero or very small. By simulating datasets with known haplotype frequencies and comparing the empirical distribution with various theoretical distributions, we characterized the distribution of the LRT statistic as a χ2 distribution where the degrees of freedom are related to the number of the haplotypes with nonzero frequency estimates. Awareness of the potential pitfalls and the strategies to address them will increase the effectiveness of haplotype-based association as a gene-mapping tool
Predicting functionally important SNP classes based on negative selection
<p>Abstract</p> <p>Background</p> <p>With the advent of cost-effective genotyping technologies, genome-wide association studies allow researchers to examine hundreds of thousands of single nucleotide polymorphisms (SNPs) for association with human disease. Recently, many researchers applying this strategy have detected strong associations to disease with SNP markers that are either not in linkage disequilibrium with any nonsynonymous SNP or large distances from any annotated gene. In such cases, no well-established standard practice for effective SNP selection for follow-up studies exists. We aim to identify and prioritize groups of SNPs that are more likely to affect phenotypes in order to facilitate efficient SNP selection for follow-up studies.</p> <p>Results</p> <p>Based on the annotations available in the Ensembl database, we categorized SNPs in the human genome into classes related to regulatory attributes, such as epigenetic modifications and transcription factor binding sites, in addition to classes related to gene structure and cross-species conservation. Using the distribution of derived allele frequencies (DAF) within each class, we assessed the strength of natural selection for each class relative to the genome as a whole. We applied this DAF analysis to Perlegen resequenced SNPs genome-wide. Regulatory elements annotated by Ensembl such as specific histone methylation sites as well as classes defined by cross-species conservation showed negative selection in comparison to the genome as a whole.</p> <p>Conclusions</p> <p>These results highlight which annotated classes are under purifying selection, have putative functional importance, and contain SNPs that are strong candidates for follow-up studies after genome-wide association. Such SNP annotation may also be useful in interpreting results of whole-genome sequencing studies.</p
Statistical significance for hierarchical clustering in genetic association and microarray expression studies
BACKGROUND: With the increasing amount of data generated in molecular genetics laboratories, it is often difficult to make sense of results because of the vast number of different outcomes or variables studied. Examples include expression levels for large numbers of genes and haplotypes at large numbers of loci. It is then natural to group observations into smaller numbers of classes that allow for an easier overview and interpretation of the data. This grouping is often carried out in multiple steps with the aid of hierarchical cluster analysis, each step leading to a smaller number of classes by combining similar observations or classes. At each step, either implicitly or explicitly, researchers tend to interpret results and eventually focus on that set of classes providing the "best" (most significant) result. While this approach makes sense, the overall statistical significance of the experiment must include the clustering process, which modifies the grouping structure of the data and often removes variation. RESULTS: For hierarchically clustered data, we propose considering the strongest result or, equivalently, the smallest p-value as the experiment-wise statistic of interest and evaluating its significance level for a global assessment of statistical significance. We apply our approach to datasets from haplotype association and microarray expression studies where hierarchical clustering has been used. CONCLUSION: In all of the cases we examine, we find that relying on one set of classes in the course of clustering leads to significance levels that are too small when compared with the significance level associated with an overall statistic that incorporates the process of clustering. In other words, relying on one step of clustering may furnish a formally significant result while the overall experiment is not significant
Are Molecular Haplotypes Worth the Time and Expense? A Cost-Effective Method for Applying Molecular Haplotypes
Because current molecular haplotyping methods are expensive and not amenable to automation, many researchers rely on statistical methods to infer haplotype pairs from multilocus genotypes, and subsequently treat these inferred haplotype pairs as observations. These procedures are prone to haplotype misclassification. We examine the effect of these misclassification errors on the false-positive rate and power for two association tests. These tests include the standard likelihood ratio test (LRT(std)) and a likelihood ratio test that employs a double-sampling approach to allow for the misclassification inherent in the haplotype inference procedure (LRT(ae)). We aim to determine the cost–benefit relationship of increasing the proportion of individuals with molecular haplotype measurements in addition to genotypes to raise the power gain of the LRT(ae) over the LRT(std). This analysis should provide a guideline for determining the minimum number of molecular haplotypes required for desired power. Our simulations under the null hypothesis of equal haplotype frequencies in cases and controls indicate that (1) for each statistic, permutation methods maintain the correct type I error; (2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and (3) our simulations under the alternative hypothesis showed a significant power gain for the LRT(ae) over the LRT(std) for a subset of the parameter settings. Permutation methods should be used exclusively to determine significance for each statistic. For fixed cost, the power gain of the LRT(ae) over the LRT(std) varied depending on the relative costs of genotyping, molecular haplotyping, and phenotyping. The LRT(ae) showed the greatest benefit over the LRT(std) when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole-genome association study
Developmental dysplasia of the hip: Linkage mapping and whole exome sequencing identify a shared variant in CX3CR1 in all affected members of a large multi-generation family.
Developmental Dysplasia of the Hip (DDH) is a debilitating condition characterized by incomplete formation of the acetabulum leading to dislocation of the femur, suboptimal joint function, and accelerated wear of the articular cartilage resulting in arthritis. DDH affects 1 in 1000 newborns in the United States with well defined pockets of high prevalence in Japan, Italy and other Mediterranean countries. Although reasonably accurate for detecting gross forms of hip dysplasia, existing techniques fail to find milder forms of dysplasia. Undetected hip dysplasia is the leading cause of osteoarthritis of the hip in young individuals causing over 40% of cases in this age group. A sensitive and specific test for DDH has remained a desirable yet elusive goal in orthopaedics for a long time. A 72 member, four generation affected family has been recruited, and DNA from its members retrieved. Genome-wide linkage analysis revealed a 2.61 Mb candidate region (38.7-41.31 Mb from the p term of chromosome 3) co-inherited by all affected members with a maximum LOD score of 3.31. Whole exome sequencing and analysis of this candidate region in four severely affected family members revealed one shared variant, rs3732378, that causes a threonine (polar) to methionine (non-polar) alteration at position 280 in the trans-membrane domain of CX3CR1. This mutation is predicted to have a deleterious effect on its encoded protein which functions as a receptor for the ligand fractalkine. By Sanger sequencing this variant was found to be present in the DNA of all affected individuals and obligate heterozygotes. CX3CR1 mediates cellular adhesive and migratory functions and is known to be expressed in mesenchymal stem cells destined to become chondrocytes. A genetic risk factor that might to be among the etiologic factors for the family in this study has been identified, along with other possible aggravating mutations shared by 4 severely affected family members. These findings might illuminate the molecular pathways affecting chondrocyte maturation and bone formation
An analysis of Methylenetetrahydrofolate reductase and Glutathione S-transferase omega-1 genes as modifiers of the cerebral response to ischemia
<p>Abstract</p> <p>Background</p> <p>Cerebral ischemia involves a series of reactions which ultimately influence the final volume of a brain infarction. We hypothesize that polymorphisms in genes encoding proteins involved in these reactions could act as modifiers of the cerebral response to ischemia and impact the resultant stroke volume. The final volume of a cerebral infarct is important as it correlates with the morbidity and mortality associated with non-lacunar ischemic strokes.</p> <p>Methods</p> <p>The proteins encoded by the methylenetetrahydrofolate reductase (<it>MTHFR</it>) and glutathione S-transferase omega-1 (<it>GSTO-1</it>) genes are, through oxidative mechanisms, key participants in the cerebral response to ischemia. On the basis of these biological activities, they were selected as candidate genes for further investigation. We analyzed the C677T polymorphism in the <it>MTHFR </it>gene and the C419A polymorphism in the <it>GSTO-1 </it>gene in 128 patients with non-lacunar ischemic strokes.</p> <p>Results</p> <p>We found no significant association of either the <it>MTHFR </it>(p = 0.72) or <it>GSTO-1 </it>(p = 0.58) polymorphisms with cerebral infarct volume.</p> <p>Conclusion</p> <p>Our study shows no major gene effect of either the <it>MTHFR </it>or <it>GSTO-1 </it>genes as a modifier of ischemic stroke volume. However, given the relatively small sample size, a minor gene effect is not excluded by this investigation.</p
Statistical significance for hierarchical clustering in genetic association and microarray expression studies
<p>Abstract</p> <p>Background</p> <p>With the increasing amount of data generated in molecular genetics laboratories, it is often difficult to make sense of results because of the vast number of different outcomes or variables studied. Examples include expression levels for large numbers of genes and haplotypes at large numbers of loci. It is then natural to group observations into smaller numbers of classes that allow for an easier overview and interpretation of the data. This grouping is often carried out in multiple steps with the aid of hierarchical cluster analysis, each step leading to a smaller number of classes by combining similar observations or classes. At each step, either implicitly or explicitly, researchers tend to interpret results and eventually focus on that set of classes providing the "best" (most significant) result. While this approach makes sense, the overall statistical significance of the experiment must include the clustering process, which modifies the grouping structure of the data and often removes variation.</p> <p>Results</p> <p>For hierarchically clustered data, we propose considering the strongest result or, equivalently, the smallest <it>p</it>-value as the experiment-wise statistic of interest and evaluating its significance level for a global assessment of statistical significance. We apply our approach to datasets from haplotype association and microarray expression studies where hierarchical clustering has been used.</p> <p>Conclusion</p> <p>In all of the cases we examine, we find that relying on one set of classes in the course of clustering leads to significance levels that are too small when compared with the significance level associated with an overall statistic that incorporates the process of clustering. In other words, relying on one step of clustering may furnish a formally significant result while the overall experiment is not significant.</p
Integration of Linkage Analysis and Next-Generation Sequencing Data
Genetic mapping by linkage analysis has been for many years the first step in the identification of genes responsible for rare Mendelian disorders. When the focus of genetic research shifted toward the study of the more complex common disorders, alternative approaches such as association studies were shown to be more successful in identifying common variants of small effect that are in part responsible for susceptibility to such conditions. Recent advances in technologies that make feasible the sequencing of whole exomes or genomes have renewed interest in the identification of rare variants, which are in principle amenable to being detected by linkage analysis. As a result, linkage analysis and family based studies in general are being reexamined as an aid to filter and validate results of whole exome and whole genome sequencing experiments. This chapter will describe a few representative papers that have incorporated linkage analysis and its results in the design, execution, and interpretation of whole genome or whole exome sequencing studies