Skip to main content
Article thumbnail
Location of Repository

Genome-wide linkage scans and basic bioinformatics implemented using Stata/SE

By Toby Andrew


Searches for genes using linkage analyses with genetic markers placed across the entire human genome are hypothesis-free experiments, which represent an extreme form of multiple testing. As such, the low p-values required to obtain nominal significance make accurate diagnostics essential to assess model fit and to eliminate naive incorrect results. In hypothesis-driven single tests, researchers usually take good care to assess model fit and the validity of model assumptions, but such concerns are usually ignored when it comes to linkage analysis. This is particularly problematic where low thresholds (p > 0.0001) can result in extreme sensitivity to outlying observations and for some models (e.g. standard variance component analysis), greater sensitivity to violation of model assumptions. Here we attempt to address these problems for genomic data based on 1300 healthy sib-pairs (dizygotic twins) using modified Haseman-Elston regression-based linkage analysis for quantitative traits, in which sib-pair phenotypic covariance is correlated with genetic marker covariance. The statistical theory underpinning the implementation of tests for linkage using generalized linear models (GLM) (Author-Email: glm in Stata) is documented in detail elsewhere. In brief, the advantage of analysing sib-pairs using GLM is that the approach shares all of the strengths of OLS and variance components, but none of their weaknesses. These are that (1) unlike OLS, the residual errors are correctly specified with a gamma distribution and known heteroscedasticity is accounted for; (2) unlike standard variance components, by freely estimating the coefficient of variation, GLM is robust to phenotypic deviations from multivariate normality. Just as important are the practical advantages. With the release of Stata8/Special Edition for large datasets, we have been able to store and check genetic markers for all 22 pairs of autosomal chromosomes plus sex chromosomes. In addition, we have generated 2-point and multipoint allele-sharing identical by descent (IBD) elsewhere and imported this into Stata. Using Stata scripts with a simple loop structure that calls on the glm command, we are able to perform genome-wide scans and save any summary statistics to file. We have been able to utilise the following features in Stata: 1. correct diagnostics on a genome-wide basis that are not normally made available to users of applied linkage packages 2. robust estimates of significance, such as Huber sandwich estimates, bootstrap routines, permutation tests, etc. 3. probability weighting to utilise the full probability distribution of the number of alleles shared IBD 4. computationally fast and easy to implement Finally, we also can perform basic, but powerful bioinformatics tasks such as: 1. using the xpose command to summarise marker information by chromosome and sib-pair 2. resolving marker order more accurately, which is essential for correct multipoint IBD generation, by interpolating genetic distance using the latest physical and genetic marker maps

OAI identifier:
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • (external link)
  • Suggested articles

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.