26 research outputs found
Scatter plot of the mean allele frequencies per genome (n = 50) vs. age of the genomes (calibrated years before present).
Red line represents the linear model. We found that there is no significant correlation between the age of the individuals and the mean allele frequency (Spearmanâs rank correlation Ï = -0.12, P = 0.41). (TIF)</p
Heatmaps of CONGA-genotyped n = 10,002 human-derived deletions across ancient genomes.
The color key includes 0 (gray) for reference allele, 1 (green) for heterozygous, 2 (magenta) for homozygous state and NA (white) for missing value. (A) Heatmap of deletions per genome on the raw dataset (n = 71 genomes). (B) Heatmap of deletions per genome on the refined dataset (with n = 50 genomes after removing divergent genomes). (TIF)</p
PrecisionâRecall plots for simulations.
Precision-Recall curves for deletion (A) and duplication (B) predictions of CONGA, GenomeSTRiP, FREEC, and CNVnator using coverages of 0.05Ă, 0.1Ă, 0.5Ă, 1Ă and 5Ă. mrCaNaVaR was used only in the analysis of large variants. (TIF)</p
Overall workflow of CONGA.
The first step involves initialization, where we create the input (reference) CNV file using deletions and duplications identified in high quality genome sets. We apply CONGA-genotyping in the second step and create the initial CNV call set. We then perform filtering and refining steps, and thus generate the final CNV call set.</p
Correlations between SNP- and deletion-based deleterious load estimates in 50 ancient genomes.
(A) From left to right: histograms of the number of SIFT-predicted âdeleteriousâ SNPs over âtoleratedâ SNPs per genome, CONGA-predicted total deletion length in kb per genome, and the number of genes that overlap with CONGA-predicted deletions per genome. (B) Correlations between each variable. The RHS triangle shows the scatter plots between two variables, and the LHS triangle shows the Spearman rank correlation estimates. The significance of the Ïâs are also shown. **** Ï (TIF)</p
Effect of minimum read-pair support on the F-Score for duplications using various depths of coverage values in simulated genomes.
Medium sized CNVs are between 1,000 bps to 10,000 bps and large CNVs are between 10,000 bps and 100,000 bps. Here, we used a relaxed C-score threshold of 10 in order to observe the effect of read-pair support only. The figure shows that read-pair support is effective when the coverage is above 0.5x and also when the duplication sizes are larger. (TIF)</p
Performance comparison of multiple tools and CONGA performance analysis.
Table A shows the CNV predictions of CONGA, GenomeSTRiP, FREEC, CNVnator and mrCaNaVaR on simulated genomes at depths 0.05Ă, 0.1Ă, 0.5Ă, 1Ă and 5Ă for deletions and duplications of multiple CNV size intervals including 100 bpsâ1 kbps (small), 1 kbpsâ10 kbps (medium) and 10 kbpsâ100 kbps (large). Here, âTâ and âFâ refer to correct and incorrect predictions respectively, âMissâ is the number of missed true events, âRecallâ (TPR) is the true positive rate, and âFDRâ is false discovery rate (1ââPrecisionâ) for each run. The F-Score is calculated as (2 * Precision * Recall) / (Precision + Recall). Note that for CONGA, we included the performance for both C-score Table B shows a comparison between CONGA and GenomeSTRiP predictions on simulated genomes at depths 0.05Ă, 0.1Ă, 0.5Ă, 1Ă and 5Ă for deletions and duplications of 1 kbpsâ10 kbps (medium) and 10 kbpsâ100 kbps (large) CNV size intervals. Table C shows the copy-number (homozygous or heterozygous) predictions of CONGA on simulated genomes at depths 0.05Ă, 0.1Ă, 0.5Ă, 1Ă and 5Ă for deletions and duplications of multiple CNV size intervals including 100 bpsâ1 kbps (small), 1 kbpsâ10 kbps (medium) and 10 kbpsâ100 kbps (large). Here, âTâ and âFâ refer to correct and incorrect predictions respectively, âMissâ is the number of missed true events, âRecallâ (TPR) is the true positive rate, and âFDRâ is false discovery rate (1ââPrecisionâ) for each run. The F-Score is calculated as (2 * Precision * Recall) / (Precision + Recall). Note that for CONGA, we included the performance for both C-score Table D shows deletion and duplication predictions of CONGA using Mota, Saqqaq and Yamnaya genomes down-sampled to various depths from their original coverages of 9.6Ă, 13.1Ă and 23.3Ă, respectively. Here, âTâ and âFâ refer to correct and incorrect predictions respectively, âMissâ is the number of missed true events, âRecallâ (TPR) is the true positive rate, and âFDRâ is false discovery rate (1ââPrecisionâ) for each run. The F-Score is calculated as (2 * Precision * Recall) / (Precision + Recall). We calculated âTrueâ, âFalseâ, âMissâ, âRecallâ, âPrecisionâ, FDR and F-Score of down-sampled genomes assuming that our CONGA-based predictions with the original genomes (full data) reflect the ground truth. These predictions, in turn, were made using modern-day CNVs as candidate CNV list. The purpose of the experiment was to evaluate accuracy at lower coverage relative to the full data. Table E shows CONGAâs running time and memory consumption on genomes of various depths of coverage calculated using the down-sampled 23Ă Yamnaya genome (with coverages between 23Ă and 0.07Ă) as well as a comparison of CONGA, GenomeSTRiP, FREEC and CNVnator using a 5Ă simulated genome. Table F shows the results of CONGA runs on simulated genomes at a range of parameters, performed in order to determine the optimum parameters to be used. We tested multiple parameter combinations of C-Score, minimum read-pair support, mappability and minimum mapping quality (MAPQ) using simulated genomes. (XLSX)</p
A sample deletion missed due to poor signal.
An inserted deletion in a simulated ancient genome at 1Ă depth of coverage. The event breakpoint is chr22:39,386,521-39,391,930. CONGA missed this deletion due to the poor signal. (TIF)</p
Multivariate analysis of deletion frequencies reveal outlier genomes.
Left panels: Multidimensional scaling plots (MDS) calculated with k = 2 using the R âcmdscaleâ function on a Euclidean distance matrix of deletion frequencies. Middle panels: Principal component analysis plots (PCA) summarizing deletion frequencies after removing any NAs. Right panels: Hierarchical clustering trees summarizing Manhattan distance matrices, calculated using the R âdistâ and âhclustâ functions. The color codes indicate the laboratory-of-origin of each genome, shown in the legend of the top right panel. (A) Results based on the full dataset with 10,002 human-derived deletions (n = 8,780 genotyped in any state in at least one genome) and n = 71 genomes. In the PCA we use nD = 580 deletions after removing loci with at least one missing value. (B) Results based on n = 60 genomes after removing 11 outlier genomes (and nD = 3,460 deletions in the PCA). (C) Results based on n = 50 genomes after removing 21 outlier genomes (and nD = 3,472 deletions in the PCA). We note that the MDS here differs from that shown in Fig 5, in that the latter is calculated using outgroup-f3 statistics. (TIF)</p