Joint Multivariate Modelling and Prediction for Genetic and Biomedical Data

Abstract

In the area of statistical genetics, classical genome-wide association studies (GWAS) assess the association between a biological characteristic and genetic variants, working with one variant at a time in a regression model, and reporting the most significant associations. These studies test genetic markers individually, even though the data may exhibit multivariate structure due to the way genes are transmitted together from the parents to the offspring. Despite considering covariates like age and sex in the model, the classical GWAS does not account for the joint effects of genetic variants. Moreover, when multiple genetic variants within a gene have small effects on a phenotype, testing them individually can lack statistical power, but testing them together in a joint model can be more useful in pooling together all the evidence. In this thesis, I reviewed different multivariate testing procedures in joint multivariate model settings, explored their properties, and demonstrated them in further real-life database applications, such as enhancing statistical power by conditioning on major variants. I studied the mathematical properties of various multivariate test procedures, particularly within the context of multiple linear regression. Considering the theoretical aspect as well as their availability in literature, I adapt various multivariate test procedures for canonical correlation in multiple regression settings. These procedures have been demonstrated to asymptotically follow the chi-square distribution. Importantly, these test procedures exhibit asymptotic equivalence among themselves and with the Wald test statistic. This indicates that the Wald test statistic may be sufficient for future studies, given its equivalence to the multivariate test procedures. In many cases, there are known databases of major genetic variants that have a substantial effect on the trait. In such situations, it makes sense statistically to condition on these major variants to improve power in detecting associations with new variants, but this is not a common practice in GWAS applications. In this study, we also showed theoretically and computationally how conducting a joint analysis of the genetic variants in a multiple regression model, where the estimated effect of a new variant is conditioned upon some major variants, can improve the performance of the model in terms of reducing the standard error and improving the power. The amount of gain of power will depend on the correlation between the response and the covariates, as well as the correlation between the covariates. I further show that conditional results can sometimes be obtained from publicly available summary statistics reported for univariate associations in published GWAS studies, even when the individual-level data are unavailable. A prominent example of such a trait is skin color, for which there are many studies consistently identifying a handful of major genes. I looked into a dataset of over 6,500 mixed-ethnicity Latin Americans to see how the conditioning process can improve the detection power of GWAS studies and identify new genetic variants in such a situation. In practical applications, the statistical models I worked with for association testing can be carried forward for predictive purposes in new datasets. In this thesis, I have also demonstrated mathematical formulations of prediction errors in different linear models, including simple linear regression models, as well as shrinkage methods like ridge regression and lasso regression. These expressions for prediction errors show the inherent trade-off between bias and variance at both individual data points and across a set of observations. Moreover, these formulations have found the connections between prediction errors and genetic heritability that can enhance prediction performance in genetic association studies. Additionally, I reviewed various statistical and machine learning predictive models. Based on a dental morphology dataset, I compared their performance using classification metrics such as average error rate and maximum classification error rate per specimen

    Similar works