Genetic analyses such as linkage and genome wide association studies (GWAS) have
been extremely successful at identifying genomic regions that harbour genetic
variants contributing to complex disorders. Over 90% of disease-associated variants
from GWAS fall within non-coding regions (Maurano et al., 2012). However,
pinpointing the causal variants has proven a major bottleneck to genetic research.
To address this I have developed SuRFR, an R package for the ranked prioritisation
of candidate causal variants by predicted function. SuRFR produces rank orderings
of variants based upon functional genomic annotations, including DNase
hypersensitivity signal, chromatin state, minor allele frequency, and conservation.
The ranks for each annotation are combined into a final prioritisation rank using a
weighting system that has been parametrised and tested through ten-fold cross-validation.
SuRFR has been tested extensively upon a combination of synthetic and real datasets
and has been shown to perform with high sensitivity and specificity. These analyses
have provided insight into the extent to which different classes of functional
annotation are most useful for the identification of known regulatory variants: the
most important factor for identifying a true variant across all classes of regulatory
variants is position relative to genes. I have also shown that SuRFR performs at least
as well as its nearest competitors whilst benefiting from the advantages that come
from being part of the R environment.
I have applied SuRFR to several genomics projects, particularly the study of
psychiatric illness, including genome sequencing of a large Scottish family with
bipolar disorder. This has resulted in the prioritisation of such variants for future
study