25 research outputs found

    Resolution for varying relatedness using GRM, encGRM and <i>encG-reg</i>.

    No full text
    The figure shows the resolution for detecting relatives or overlapping samples with respect to varying number of markers at every row (for better illustration me was twice that of Eq 3) and the degree of relatives to be detected (r = 0, 1, and 2). The y axis is the relatedness calculated from GRM and the x axis is the estimated relatedness calculated from encG-reg (A) and encGRM (B). Each point represents an individual pair between cohort 1 and cohort 2 (there are 200 × 200 = 40,000 pairs in total), given the simulated relatedness. The dotted line indicates the 95% confidence interval of the relatedness directly estimated from the original genotype (blue) and the encrypted genotype (red). The table provides how m and k are estimated. The columns “under minimal me” provide benchmark for a parameter, and it is practically to choose 2×me and then estimate k as shown under the column “practical me”.</p

    Cohort-level genetic background analyses for Chinese cohorts under parsimony encG-reg analysis.

    No full text
    (A) Overview of the intersected SNPs across cohorts, a black dot indicated its corresponding cohort was included. Each row represented one cohort while each column represented one combination of cohorts. Dots linked by lines suggested cohorts in this combination. The height of bars represented the cohort’s SNP numbers (rows) or SNP intersection numbers (columns). Inset histogram plots show the distribution of the 7,009 intersected SNPs and the 500 SNPs randomly chosen from the 7,009 SNPs for encG-reg analysis. (B) 7,009 SNPs were used to estimate fPC from the intersection of SNPs for the 9 cohorts. Each triangle represented one Chinese cohort and was placed according to their first two principal component scores (fPC1 and fPC2) derived from the received allele frequencies. (C) Five private datasets have been pinned onto the base map from GADM (https://gadm.org/data.html) using R language. The size of point indicates the sample size of each dataset. (D) Global fStructure plot indicates global-level Fst-derived genetic composite projected onto the three external reference populations: 1KG-CHN (CHB and CHS), 1KG-EUR (CEU and TSI), and 1KG-AFR (YRI), respectively; 4,296 of the 7,009 SNPs intersected with the three reference populations were used. (E) Within Chinese fStructure plot indicates within-China genetic composite. The three external references are 1KG-CHB (North Chinese), 1KG-CHS (South Chinese), and 1KG-CDX (Southwest minority Chinese Dai), respectively; 4,809 of the 7,009 SNPs intersected with these three reference populations were used. Along x axis are 9 Chinese cohorts and the height of each bar represents its proportional genetic composition of the three reference populations. Cohort codes: YRI, Yoruba in Ibadan representing African samples; CHB, Han Chinese in Beijing; CHS, Southern Han Chinese; CHN, CHB and CHS together; CEU, Utah Residents with Northern and Western European Ancestry; TSI, Tuscani in Italy; CDX, Chinese Dai in Xishuangbanna.</p

    <i>m</i><sub><i>e</i></sub> estimation in UK Biobank and Chinese cohorts.

    No full text
    Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.</div

    Workflow of <i>encG-reg</i> and its practical timeline as exercised in Chinese cohorts.

    No full text
    The mathematical details of encG-reg are simply algebraic, but its inter-cohort implementation involves coordination. (A) We illustrate its key steps, the time cost of which was adapted from the present exercise for 9 Chinese datasets (here simplified as three cohorts). Cohort assembly: It took us about a week to call and got positive responses from our collaborators (See Table 3), who agreed with our research plan. Inter-cohort QC: we received allele frequencies reports from each cohort and started to implement inter-cohort QC according to “geo-geno” analysis (see Fig 6). This step took about two weeks. Encrypt genotypes: upon the choice of the exercise, it could be exhaustive design (see UKB example), which may maximize the statistical power but with increased logistics such as generating pairwise Sij; in the Chinese cohorts study we used parsimony design, and generated a unique S given 500 SNPs that were chosen from the 7,009 common SNPs. It took about a week to determine the number of SNPs and the dimension of k according to Eq 3 and 4, and to evaluate the effective number of markers. Perform encG-reg and validation: we conducted inter-cohort encG-reg and validated the results (see Fig 7 and Table 4). It took one week. (B) Two interactions between data owners and central analyst, including example data for exchange and possible attacks and corresponding preventative strategies.</p

    Summary of cross-cohort quality control.

    No full text
    Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.</div
    corecore