35 research outputs found

    Fast Principal Component Analysis of Large-Scale Genome-Wide Data

    Get PDF
    <div><p>Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.</p></div

    Algorithm 1.

    No full text
    <p>Pseudocode for the eigen-decomposition variant of the fast PCA, based on the randomized algorithm of <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766-Halko2" target="_blank">[5]</a> for the case where . is the standardization in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766.e077" target="_blank">Equation 4</a>. is a function generating an iid multivariate normal matrix, is the user-defined number of extra dimensions, is the QR decomposition, is a function that divides each column by its norm , is the eigen-decomposition producing the -top eigenvectors and vector of -top eigenvalues . is the matrix of principal components.</p

    Performance of the genomic risk score in external validation, when compared to other approaches, and on other related diseases.

    No full text
    <p>ROC curves for models trained in the UK2 dataset and tested on (a) four other CD datasets, (b) the Immunochip CD dataset, comparing the GRS approach with that of Romanos et al. <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004137#pgen.1004137-Romanos1" target="_blank">[21]</a>, and (c) three other autoimmune diseases (Crohn's disease, Rheumatoid Arthritis, and Type 1 Diabetes). We did not re-tune the models on the test data. For (b) and (c), we used a reduced set of SNPs for training, from the intersection of the UK2 SNPs with the Immunochip or WTCCC SNPs (18,252 SNPs and 76,847 SNPs, respectively). In (c), the same reduced set of SNPs was used for the CD-Finn dataset, in order to maintain the same SNPs across all target datasets.</p

    Clinical interpretation as a function of threshold and prevalence.

    No full text
    <p>The number of non-CD cases “misdiagnosed” (wrongly implicated by GRS) per true CD cases “diagnosed” (correctly implicated by GRS), for different levels of sensitivity. The risk score is based on a model trained on the UK2 dataset, and tested on the combined Finn+NL+IT dataset. The results were threshold-averaged over 50 independent replications. Note that the curve for <i>K</i> = 1% does not span the entire range due to averaging over a small number of cases in that dataset.</p

    Interactions within the MHC contribute to the genetic architecture of celiac disease

    Get PDF
    <div><p>Interaction analysis of GWAS can detect signal that would be ignored by single variant analysis, yet few robust interactions in humans have been detected. Recent work has highlighted interactions in the MHC region between known HLA risk haplotypes for various autoimmune diseases. To better understand the genetic interactions underlying celiac disease (CD), we have conducted exhaustive genome-wide scans for pairwise interactions in five independent CD case-control studies, using a rapid model-free approach to examine over 500 billion SNP pairs in total. We found 14 independent interaction signals within the MHC region that achieved stringent replication criteria across multiple studies and were independent of known CD risk HLA haplotypes. The strongest independent CD interaction signal corresponded to genes in the HLA class III region, in particular <i>PRRC2A</i> and <i>GPANK1/C6orf47</i>, which are known to contain variants for non-Hodgkin's lymphoma and early menopause, co-morbidities of celiac disease. Replicable evidence for statistical interaction outside the MHC was not observed. Both within and between European populations, we observed striking consistency of two-locus models and model distribution. Within the UK population, models of CD based on both interactions and additive single-SNP effects increased explained CD variance by approximately 1% over those of single SNPs. The interactions signal detected across the five cohorts indicates the presence of novel associations in the MHC region that cannot be detected using additive models. Our findings have implications for the determination of genetic architecture and, by extension, the use of human genetics for validation of therapeutic targets.</p></div
    corecore