Search CORE

22 research outputs found

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Author: Gad Abraham (3829777)
Michael Inouye (103694)
Publication venue
Publication date: 09/04/2014
Field of study

<div>Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.</div

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

FigShare

Figure 1

Author: Gad Abraham (3829777)
Michael Inouye (103694)
Publication venue
Publication date
Field of study

(a) The first two principal components from analyzing the HapMap3 dataset. (b) Scatter plots showing near-perfect absolute Pearson correlation (lower left-hand corner) between the 1st PCs estimated by smartpa, flashpca, shellfish, and R’s prcomp (using the standardization from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766.e077" target="_blank">Equation 4</a>). Note that since eigenvectors are only defined up to sign, the correlations may be negative as well as positive. In addition, the scale of the PCs may differ between the methods, however, this has no bearing on the interpretation of the PCs.</p

FigShare

Algorithm 1.

Author: Gad Abraham (3829777)
Michael Inouye (103694)
Publication venue
Publication date
Field of study

Pseudocode for the eigen-decomposition variant of the fast PCA, based on the randomized algorithm of <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766-Halko2" target="_blank">[5]</a> for the case where . is the standardization in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093766#pone.0093766.e077" target="_blank">Equation 4</a>. is a function generating an iid multivariate normal matrix, is the user-defined number of extra dimensions, is the QR decomposition, is a function that divides each column by its norm , is the eigen-decomposition producing the -top eigenvectors and vector of -top eigenvalues . is the matrix of principal components.</p

FigShare

Performance of the genomic risk score in external validation, when compared to other approaches, and on other related diseases.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

ROC curves for models trained in the UK2 dataset and tested on (a) four other CD datasets, (b) the Immunochip CD dataset, comparing the GRS approach with that of Romanos et al. <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004137#pgen.1004137-Romanos1" target="_blank">[21]</a>, and (c) three other autoimmune diseases (Crohn's disease, Rheumatoid Arthritis, and Type 1 Diabetes). We did not re-tune the models on the test data. For (b) and (c), we used a reduced set of SNPs for training, from the intersection of the UK2 SNPs with the Immunochip or WTCCC SNPs (18,252 SNPs and 76,847 SNPs, respectively). In (c), the same reduced set of SNPs was used for the CD-Finn dataset, in order to maintain the same SNPs across all target datasets.</p

FigShare

Distribution of genomic risk scores in cases and controls.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

(a) Kernel density estimates of the risk scores predicted using models on UK2 and tested in the combined dataset Finn+NL+IT, for cases and controls. (b) Thresholds for risk scores in terms of population percent, with the top more likely to be a CD and the bottom more likely to be non-CD.</p

FigShare

Building genomic models predictive of celiac disease.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

LOESS-smoothed (a) AUC and (b) phenotypic variance explained, from 10×10 cross-validation, with differing model sizes, within each celiac dataset. The grey bands represent 95% confidence intervals about the mean LOESS smooth.</p

FigShare

The analysis workflow.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

The analysis workflow.</p

FigShare

Clinical interpretation as a function of threshold and prevalence.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

The number of non-CD cases “misdiagnosed” (wrongly implicated by GRS) per true CD cases “diagnosed” (correctly implicated by GRS), for different levels of sensitivity. The risk score is based on a model trained on the UK2 dataset, and tested on the combined Finn+NL+IT dataset. The results were threshold-averaged over 50 independent replications. Note that the curve for K = 1% does not span the entire range due to averaging over a small number of cases in that dataset.</p

FigShare

Example clinical scenarios.

Author: Adam Kowalczyk (90079)
Gad Abraham (3829777)
Jason A. Tye-Din (523005)
Justin Zobel (241587)
Michael Inouye (103694)
Oneil G. Bhalala (523006)
Publication venue
Publication date
Field of study

The GRS can be employed in different clinical scenarios and tuned to optimize outcomes. The GRS can be employed in a comparable manner to HLA testing (left table) to confidently exclude CD. In this scenario, we selected a GRS threshold based on NPV = 99.6% however a range of thresholds can be selected to achieve a high NPV (see note below). The GRS can also stratify CD risk (right table). Confirmatory testing (such as small bowel biopsy) would be reserved for those at high-risk. In this example, we present two scenarios: optimization of PPV or of sensitivity. In comparison to the GRS, all HLA-susceptible patients will need to undergo further confirmatory testing for CD. For more information on GRS performance across a range of thresholds, see <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004137#pgen.1004137.s007" target="_blank">Table S2</a>. Prospective validation of the GRS in local populations would enable the most appropriate settings for NPV, PPV and sensitivity to be identified which provide the optimal diagnostic outcomes. + The highest achievable NPV at 10% prevalence was 99.4%.</p

FigShare

Quantitative links between coronary artery disease risk and selection signals in BCAS3.

Author: Andrew Bakshi (3408611)
Gad Abraham (3829777)
Lesley-Ann Gray (4166863)
Michael Inouye (103694)
Qin Qin Huang (4166860)
Samuli Ripatti (144251)
Sean G. Byars (381873)
Stephen C. Stearns (4166866)
Publication venue
Publication date
Field of study

A. Correlation between selection signals (iHS) and coronary artery disease (CAD) log odds genetic risk (log odds, ln(OR)), both represented as absolute values. Red line/upper right value, β from mixed effects regression. B. Base pair positional comparison of selection signals and CAD genetic risk across BCAS3. Blue points, CAD log odds values; grey-orange or non-significant-significant points, iHS scores. Horizontal bar shows BCAS3 gene (and intron) span and location of lead index SNP. Blue/orange lines are smoothed lines estimated with loess function in R. C. LD plots, r2. Populations: CEU, Utah residents with ancestry from northern and western Europe from the CEPH collection; YRI, Yoruba from Ibadan, Nigeria.</p

FigShare