66 research outputs found

    Estimation of kinship coefficient in structured and admixed populations using sparse sequencing data

    No full text
    <div><p>Knowledge of biological relatedness between samples is important for many genetic studies. In large-scale human genetic association studies, the estimated kinship is used to remove cryptic relatedness, control for family structure, and estimate trait heritability. However, estimation of kinship is challenging for sparse sequencing data, such as those from off-target regions in target sequencing studies, where genotypes are largely uncertain or missing. Existing methods often assume accurate genotypes at a large number of markers across the genome. We show that these methods, without accounting for the genotype uncertainty in sparse sequencing data, can yield a strong downward bias in kinship estimation. We develop a computationally efficient method called SEEKIN to estimate kinship for both homogeneous samples and heterogeneous samples with population structure and admixture. Our method models genotype uncertainty and leverages linkage disequilibrium through imputation. We test SEEKIN on a whole exome sequencing dataset (WES) of Singapore Chinese and Malays, which involves substantial population structure and admixture. We show that SEEKIN can accurately estimate kinship coefficient and classify genetic relatedness using off-target sequencing data down sampled to ~0.15X depth. In application to the full WES dataset without down sampling, SEEKIN also outperforms existing methods by properly analyzing shallow off-target data (~0.75X). Using both simulated and real phenotypes, we further illustrate how our method improves estimation of trait heritability for WES studies.</p></div

    Performance of heterogeneous kinship estimators in ~0.15X sequencing data of 762 Chinese and Malays.

    No full text
    <p>In each panel, we compared sequence-based estimates (<i>Ï•</i><sub>seq</sub>, y-axis) with the array-based estimates from PC-Relate (<i>Ï•</i><sub>array</sub>, x-axis). Colored circles represent kinship coefficients between two individuals and different types of relatedness were determined in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1007021#pgen.1007021.g002" target="_blank">Fig 2</a>. Grey crosses represent self-kinship coefficients. We evaluated SEEKIN (A, E), PC-Relate (B, F), REAP (C, G), and RelateAdmix (D, H) using the BEAGLE call set (A-D), and the BEAGLE+1KG3 call set (E-H). We only included SNPs overlapping with the SGVP dataset in the analyses, because we used the SGVP dataset as the reference panel to estimate individual-specific allele frequencies for SEEKIN, REAP and RelateAdmix.</p

    Population structure of 2,452 individuals in the Singapore Living Biobank Project.

    No full text
    <p>(A) Reference ancestry space derived from PCA on the genotypes of Chinese (CHS), Malays (MAS) and Indians (INS) from SGVP. (B) Estimated ancestry in the SGVP reference space based on LASER analysis. Colored symbols represent study individuals of self-reported Chinese and Malays. Grey symbols represent the SGVP reference individuals. (C) Estimated admixture proportion based on supervised ADMIXTURE analysis with the SGVP data as the reference. We specified K = 3 clusters in the ADMIXTURE analysis, which represent Chinese (blue), Malay (green), and Indian (orange) ancestry components.</p

    Heritability estimation for simulated traits in 762 Chinese and Malays.

    No full text
    <p>We simulated quantitative traits of heritability <i>h</i><sup>2</sup> = 0.5 using a linear mixed model <i>Y</i>∼<i>N</i>(0,2<b>Φ</b> + <b><i>I</i></b>), where <b>Φ</b> is the array-based kinship matrix from PC-Relate and <b><i>I</i></b> is the identity matrix. We used the REML method in GEMMA to estimate heritability based on kinship matrices derived from WES data with or without off-target data using SEEKIN or PC-Relate (<a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1007021#pgen.1007021.g006" target="_blank">Fig 6</a>). We also considered a case where the off-target data were down-sampled to ~0.15X but the target data remained the same. Each box represents heritability estimates of 1,000 replicates.</p

    Off-target sequencing data improve kinship estimation in WES of 762 Chinese and Malays.

    No full text
    <p>In each panel, we plotted the difference between sequence-based estimates and array-based estimates (<i>ϕ</i><sub>seq</sub>–<i>ϕ</i><sub>array</sub>, y-axis) versus the array-based estimates from PC-Relate (<i>ϕ</i><sub>array</sub>, x-axis). Colored circles represent kinship coefficients between two individuals and different types of relatedness were determined in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1007021#pgen.1007021.g002" target="_blank">Fig 2</a>. Grey crosses represent self-kinship coefficients. The analyses were based on the BEAGLE+1KG3 call set at SNPs overlapping with the SGVP dataset. We evaluated SEEKIN (A, C) and PC-Relate (B, D) using 40,824 SNPs within the WES target regions or 1,054,229 SNPs across both target and off-target regions.</p

    Ancestry and individual-specific allele frequency estimation using array data or ~0.15X sequencing data of 762 Chinese and Malays.

    No full text
    <p>(A-B) LASER ancestry estimates based on array genotypes across 435,314 SNPs overlapping with the SGVP reference dataset (A) or ~0.15X sequence reads scattering genome-wide (B). Colored symbols represent study individuals and grey symbols represent the SGVP reference individuals. The Procrustes similarity between (A) and (B) is t<sub>0</sub> = 0.9976 for 762 study individuals. (C-D) Comparison of individual-specific allele frequencies derived from LASER analysis of either array data (C) or ~0.15X sequencing data (D) to the gold standard based on ADMIXTURE analysis of array data. The two-way allele frequency space is evenly into 100×100 grids and the number of data points within each grid is color-coded according to the logarithmic scale in the color bar. The Pearson correlation is r = 0.9980 across all data points in (C) and is r = 0.9976 across all data points in (D).</p
    • …
    corecore