22 research outputs found

    Fast Principal Component Analysis of Large-Scale Genome-Wide Data

    Get PDF
    <div><p>Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.</p></div

    Figure 1

    No full text
    <p>(a) The first two principal components from analyzing the HapMap3 dataset. (b) Scatter plots showing near-perfect absolute Pearson correlation (lower left-hand corner) between the 1st PCs estimated by smartpa, flashpca, shellfish, and R’s prcomp (using the standardization from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766.e077" target="_blank">Equation 4</a>). Note that since eigenvectors are only defined up to sign, the correlations may be negative as well as positive. In addition, the scale of the PCs may differ between the methods, however, this has no bearing on the interpretation of the PCs.</p

    Algorithm 1.

    No full text
    <p>Pseudocode for the eigen-decomposition variant of the fast PCA, based on the randomized algorithm of <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766-Halko2" target="_blank">[5]</a> for the case where . is the standardization in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766.e077" target="_blank">Equation 4</a>. is a function generating an iid multivariate normal matrix, is the user-defined number of extra dimensions, is the QR decomposition, is a function that divides each column by its norm , is the eigen-decomposition producing the -top eigenvectors and vector of -top eigenvalues . is the matrix of principal components.</p

    Performance of the genomic risk score in external validation, when compared to other approaches, and on other related diseases.

    No full text
    <p>ROC curves for models trained in the UK2 dataset and tested on (a) four other CD datasets, (b) the Immunochip CD dataset, comparing the GRS approach with that of Romanos et al. <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004137#pgen.1004137-Romanos1" target="_blank">[21]</a>, and (c) three other autoimmune diseases (Crohn's disease, Rheumatoid Arthritis, and Type 1 Diabetes). We did not re-tune the models on the test data. For (b) and (c), we used a reduced set of SNPs for training, from the intersection of the UK2 SNPs with the Immunochip or WTCCC SNPs (18,252 SNPs and 76,847 SNPs, respectively). In (c), the same reduced set of SNPs was used for the CD-Finn dataset, in order to maintain the same SNPs across all target datasets.</p

    Distribution of genomic risk scores in cases and controls.

    No full text
    <p>(a) Kernel density estimates of the risk scores predicted using models on UK2 and tested in the combined dataset Finn+NL+IT, for cases and controls. (b) Thresholds for risk scores in terms of population percent, with the top more likely to be a CD and the bottom more likely to be non-CD.</p

    Building genomic models predictive of celiac disease.

    No full text
    <p>LOESS-smoothed (a) AUC and (b) phenotypic variance explained, from 10Ă—10 cross-validation, with differing model sizes, within each celiac dataset. The grey bands represent 95% confidence intervals about the mean LOESS smooth.</p

    Clinical interpretation as a function of threshold and prevalence.

    No full text
    <p>The number of non-CD cases “misdiagnosed” (wrongly implicated by GRS) per true CD cases “diagnosed” (correctly implicated by GRS), for different levels of sensitivity. The risk score is based on a model trained on the UK2 dataset, and tested on the combined Finn+NL+IT dataset. The results were threshold-averaged over 50 independent replications. Note that the curve for <i>K</i> = 1% does not span the entire range due to averaging over a small number of cases in that dataset.</p

    Example clinical scenarios.

    No full text
    <p>The GRS can be employed in different clinical scenarios and tuned to optimize outcomes. The GRS can be employed in a comparable manner to HLA testing (left table) to confidently exclude CD. In this scenario, we selected a GRS threshold based on NPV = 99.6% however a range of thresholds can be selected to achieve a high NPV (see note below). The GRS can also stratify CD risk (right table). Confirmatory testing (such as small bowel biopsy) would be reserved for those at high-risk. In this example, we present two scenarios: optimization of PPV or of sensitivity. In comparison to the GRS, all HLA-susceptible patients will need to undergo further confirmatory testing for CD. For more information on GRS performance across a range of thresholds, see <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004137#pgen.1004137.s007" target="_blank">Table S2</a>. Prospective validation of the GRS in local populations would enable the most appropriate settings for NPV, PPV and sensitivity to be identified which provide the optimal diagnostic outcomes. <sup>+</sup> The highest achievable NPV at 10% prevalence was 99.4%.</p

    Quantitative links between coronary artery disease risk and selection signals in <i>BCAS3</i>.

    No full text
    <p><b>A.</b> Correlation between selection signals (iHS) and coronary artery disease (CAD) log odds genetic risk (log odds, ln(OR)), both represented as absolute values. Red line/upper right value, <i>β</i> from mixed effects regression. <b>B.</b> Base pair positional comparison of selection signals and CAD genetic risk across <i>BCAS3</i>. Blue points, CAD log odds values; grey-orange or non-significant-significant points, iHS scores. Horizontal bar shows <i>BCAS3</i> gene (and intron) span and location of lead index SNP. Blue/orange lines are smoothed lines estimated with loess function in R. <b>C.</b> LD plots, <i>r</i><sup><i>2</i></sup>. Populations: CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; YRI, Yoruba from Ibadan, Nigeria.</p
    corecore