Identification of genomic factors using family-based association studies

Abstract

Genome-wide association studies become increasingly popular and important for detecting genetic associations of complex traits. However, it is well known that spurious associations could arise from statistical analysis without proper consideration of genetic relatedness of samples. Many methods have been proposed to guard against these spurious associations. Here we focus on multi-locus association studies of quantitative traits and the case-control status, and propose algorithms that take into consideration of genetic related samples to address possible confounding issues. As supervised dimension reduction methods, these algorithms performs well to conduct association studies with a large number of biomarkers but a relative small number of samples.^ Recently, Linear mixed models have demonstrated its efficiency in GWAS of quantitative traits with multiple levels of sample structures. Most of the current mixed model based methods such as EMMA, EMMAX, and GEMMA, can be viewed as single-locus methods by testing each SNP separately. Complex traits, however, are known to be controlled by multiple loci, thus including multiple loci in the statistical model seems more appropriate. In the first part of my dissertation, we propose an algorithm that extends penalized orthogonal component regression to family-based association studies (fPOCRE) of continuous traits. While multiple loci can be investigated at the same time, the sample relatedness is modeled through the kinship matrix and the shared confounding effects are included as random effects in the linear mixed model. Our proposed algorithm simultaneously selects biomarkers and constructs their linear combinations as components which optimally account for variation in traits. We compare fPOCRE with EMMAX, which is one of the most frequently used single-locus approach, and also compare it with MLMM, a recently developed multi-locus approach. Our simulation study demonstrates fPOCRE has promising performance over both EMMAX and MLMM in terms of higher power and fewer false positives when causal effects are from clusters of correlated SNPs. Real data are analyzed to illustrate the proposed approach and provide further comparisons.^ Case-control association study is a widely used study design in genetic epidemiology and pharmacology and this study design is also susceptible to the potential confounding by sample structure. In the second part of my dissertation, we employ a multi-locus generalized estimation equation (GEE) model to study genetic associations of binary traits, capturing multiple levels of the sample structure with working correlation matrix. The kinship matrix is used to model the working correlation matrix, and the penalized orthogonal-components regression method is developed to build such a multi-locus GEE model (aka GEE-POCRE). GEE-POCRE is compared with gPOCRE, a multi-locus method that does not consider pedigree information, also compared with TDT, FBAT, and ROADTRIPS that are single-locus methods considering sample structure. In our simulation studies, GEE-POCRE demonstrates good performance in terms of protecting against spurious associations caused by the sample structure as well as having increased power

    Similar works