28 research outputs found

    Genomic selection in farm animals: accuracy of prediction and applications with imputed whole-genome sequencing data in chicken

    Get PDF
    Methoden zur genomischen Vorhersage basierend auf Genotypinformationen von Single Nucleotide Polymorphism (SNP)-Arrays mit unterschiedlicher Markeranzahl sind mittlerweile in vielen Zuchtprogrammen für Nutztiere fest implementiert. Mit der zunehmenden Verfügbarkeit von vollständigen Genomsequenzdaten, die auch kausale Mutationen enthalten, werden mehr und mehr Studien veröffentlicht, bei denen genomische Vorhersagen beruhend auf Sequenzdaten durchgeführt werden. Das Hauptziel dieser Arbeit war zu untersuchen, inwieweit SNP-Array-Daten mit statistischen Verfahren bis zum Sequenzlevel ergänzt werden können (sogenanntes „Imputing“) (Kapitel 2) und ob die genomische Vorhersage mit imputeten Sequenzdaten und zusätzlicher Information über die genetische Architektur eines Merkmals verbessert werden kann (Kapitel 3). Um die Genauigkeit der genomischen Vorhersage besser verstehen und eine neue Methode zur Approximation dieser Genauigkeit ableiten zu können, wurde außerdem eine Simulationsstudie durchgeführt, die den Grad der Überschätzung der Genauigkeit der genomischen Vorhersage verschiedener bereits bekannter Ansätze überprüfte (Kapitel 4). Der technische Fortschritt im letzten Jahrzehnt hat es ermöglicht, in relativ kurzer Zeit Millionen von DNA-Abschnitten zu sequenzieren. Mehrere auf unterschiedlichen Algorithmen basierende Software-Programme zur Auffindung von Sequenzvarianten (sogenanntes „Variant Calling“) haben sich etabliert und es möglich gemacht, SNPs in den vollständigen Genomsequenzdaten zu detektieren detektieren. Oft werden nur wenige Individuen einer Population vollständig sequenziert und die Genotypen der anderen Individuen, die mit einem SNP-Array an einer Teilmenge dieser SNPs typisiert wurden, imputet. In Kapitel 2 wurden deshalb anhand von 50 vollständig sequenzierten Weiß- und Braunleger-Individuen die mit drei unterschiedlichen Variant-Calling-Programmen (GATK, freebayes and SAMtools) detektierten Genomvarianten verglichen und die Qualität der Genotypen überprüft. Auf den untersuchten Chromosomen 3,6 und 26 wurden 1.741.573 SNPs von allen drei Variant Callers detektiert was 71,6% (81,6%, 88,0%) der Anzahl der von GATK (SAMtools, freebayes) detektierten Varianten entspricht. Die Kenngröße der Konkordanz der Genotypen („genotype concordance“), die durch den Anteil der Individuen definiert ist, deren Array-basierte Genotypen mit den Sequenz-basierten Genotypen an allen auch auf dem Array vorhandenen SNPs übereinstimmt, betrug 0,98 mit GATK, 0,98 mit SAMtools und 0,97 mit freebayes (Werte gemittelt über SNPs auf den untersuchten Chromosomen). Des Weiteren wiesen bei Nutzung von GATK (SAMtools, freebayes) 90% (88 %, 75%) der Varianten hohe Werte (>0.9) anderer Qualitätsmaße (non-reference sensitivity, non-reference genotype concordance und precision) auf. Die Leistung aller untersuchten Variant-Calling-Programme war im Allgemeinen sehr gut, besonders die von GATK und SAMtools. In dieser Studie wurde außerdem in einem Datensatz von ungefähr 1000 Individuen aus 6 Generationen die Güte des Imputings von einem hochdichten SNP-Array zum Sequenzlevel untersucht. Die Güte des Imputings wurde mit Hilfe der Korrelationen zwischen imputeten und wahren Genotypen pro SNP oder pro Individuum und der Anzahl an Mendelschen Konflikten bei Vater-Nachkommen-Paaren beschrieben. Drei unterschiedliche Imputing-Programme (Minimac, FImpute und IMPUTE2) wurden in unterschiedlichen Szenarien validiert. Bei allen Imputing-Programmen betrug die Korrelation zwischen wahren und imputeten Genotypen bei 1000 Array-SNPs, die zufällig ausgewählt und deren Genotypen im Imputing-Prozess als unbekannt angenommen wurden, durchschnittlich mehr als 0.95 sowie mehr als 0.85 bei einer Leave-One-Out-Kreuzvalidierung, die mit den sequenzierten Individuen durchgeführt wurde. Hinsichtlich der Genotypenkorrelation zeigten Minimac und IMPUTE2 etwas bessere Ergebnisse als FImpute. Dies galt besonders für SNPs mit niedriger Frequenz des selteneren Allels. FImpute wies jedoch die kleinste Anzahl von Mendelschen Konflikten in verfügbaren Vater-Nachkommen-Paaren auf. Die Korrelation zwischen wahren und imputeten Genotypen blieb auf hohem Niveau, auch wenn die Individuen, deren Genotypen imputet wurden, einige Generationen jünger waren als die sequenzierten Individuen. Zusammenfassend zeigte in dieser Studie GATK die beste Leistung unter den getesteten Variant-Calling-Programmen, während Minimac sich unter den untersuchten Imputing-Programmen als das beste erwies. Aufbauend auf den Ergebnissen aus Kapitel 2 wurden in Kapitel 3 Studien zur genomischen Vorhersage mit imputeten Sequenzdaten durchgeführt. Daten von 892 Individuen aus 6 Generationen einer kommerziellen Braunlegerlinie standen hierfür zur Verfügung. Diese Tiere waren alle mit einem hochdichten SNP-Array genotypisiert. Unter der Nutzung der Daten von 25 vollständig sequenzierten Individuen wurden jene Tiere ausgehend von den Array-Genotypen bis zum Sequenzlevel hin imputet. Das Imputing wurde mit Minimac3 durchgeführt, das bereits haplotypisierte Daten (in dieser Studie mit Beagle4 erzeugt) als Input benötigt. Die Genauigkeit der genomischen Vorhersage wurde durch die Korrelation zwischen de-regressierten konventionellen Zuchtwerten und direkt genomischen Zuchtwerten für die Merkmale Bruchfestigkeit, Futteraufnahme und Legerate gemessen. Neben dem Vergleich der Genauigkeit der auf SNP-Array-Daten und Sequenzdaten basierenden genomischen Vorhersage wurde in dieser Studie auch untersucht, wie sich die Verwendung verschiedener genomischer Verwandtschaftsmatrizen, die die genetische Architektur berücksichtigen, auf die Vorhersagegenauigkeit auswirkt. Hierbei wurden neben dem Basisszenario mit gleichgewichteten SNPs auch Szenarien mit Gewichtungsfaktoren, nämlich den -(〖log〗_10 P)-Werten eines t-Tests basierend auf einer genomweiten Assoziationsstudie und den quadrierten geschätzten SNP-Effekten aus einem Random Regression-BLUP-Modell, sowie die Methode BLUP|GA („best linear unbiased prediction given genetic architecture“) überprüft. Das Szenario GBLUP mit gleichgewichteten SNPs wurde sowohl mit einer Verwandtschaftsmatrix aus allen verfügbaren SNPs oder nur derer in Genregionen, jeweils ausgehend von der Grundmenge aller imputeten SNPs in der Sequenz oder der Array-SNPs, getestet. Gemittelt über alle untersuchten Merkmale war die Vorhersagegenauigkeit mit SNPs aus Genregionen, die aus den imputeten Sequenzdaten extrahiert wurden, mit 0,366 ± 0,075 am höchsten. Den zweithöchsten Wert erreichte die genomische Vorhersage mit SNPs aus Genregionen, die im SNP-Array erhalten sind (0,361 ± 0,072). Weder die Verwendung gewichteter genomischer Verwandtschaftsmatrizen noch die Anwendung von BLUP|GA führten im Vergleich zum normalen GBLUP-Ansatz zu höheren Vorhersagegenauigkeiten. Diese Beobachtung war unabhängig davon, ob SNP-Array- oder imputete Sequenzdaten verwendet wurden. Die Ergebnisse dieser Studie zeigten, dass kaum oder kein Zusatznutzen durch die Verwendung von imputeten Sequenzdaten generiert werden kann. Eine Erhöhung der Vorhersagegenauigkeit konnte jedoch erreicht werden, wenn die Verwandschaftsmatrix nur aus den SNPs in Genregionen gebildet wurde, die aus den Sequenzdaten extrahiert wurden. Die Auswahl der Selektionskandidaten erfolgt in genomischen Selektionsprogrammen mit Hilfe der geschätzten genomischen Zuchtwerte (GBVs). Die Genauigkeit des GBV ist hierbei ein relevanter Parameter, weil sie die Stabilität der geschätzten Zuchtwerte beschreibt und zeigen kann, wie sich der GBV verändern kann, wenn mehr Informationen verfügbar werden. Des Weiteren ist sie einer der entscheidenden Faktoren beim erwarteten Zuchtfortschritt (auch als so genannte „Züchtergleichung“ beschrieben). Diese Genauigkeit der genomischen Vorhersage ist jedoch in realen Daten schwer zu quantifizieren, da die wahren Zuchtwerte (TBV) nicht verfügbar sind. In früheren Studien wurden mehrere Methoden vorgeschlagen, die es ermöglichen, die Genauigkeit von GBV durch Populations- und Merkmalsparameter (z.B. effektive Populationsgröße, Sicherheit der verwendeten Quasi-Phänotypen, Anzahl der unabhängigen Chromosomen-Segmente) zu approximieren. Weiterhin kann die Genauigkeit bei Verwendung von gemischten Modellen mit Hilfe der Varianz des Vorhersagefehlers abgeleitet werden. In der Praxis wiesen die meisten dieser Ansätze eine Überschätzung der Genauigkeit der Vorhersage auf. Deshalb wurden in Kapitel 4 mehrere methodische Ansätze aus früheren Arbeiten in simulierten Daten mit unterschiedlichen Parametern, mit Hilfe derer verschiedene Tierzuchtprogramme (neben einem Basisszenario ein Rinder- und ein Schweinezuchtschema) abgebildet wurden, überprüft und die Höhe der Überschätzung gemessen. Außerdem wurde in diesem Kapitel eine neue und leicht rechenbare Methode zur Approximation der Genauigkeit vorgestellt Die Ergebnisse des Vergleichs der methodischen Ansätze in Kapitel 4 zeigten, dass die Genauigkeit der GBV durch den neuen Ansatz besser vorhergesagt werden kann. Der vorgestellte Ansatz besitzt immer noch einen unbekannten Parameter, für den jedoch eine Approximation möglich ist, wenn in einem geeigneten Datensatz Ergebnisse von Zuchtwertschätzungen zu zwei verschiedenen Zeitpunkten vorliegen. Zusammenfassend kann gesagt werden, dass diese neue Methode die Approximation der Genauigkeit des GBV in vielen Fällen verbessert.Genomic prediction has been successfully applied in many livestock breeding schemes, based on different densities of single nucleotide polymorphism (SNP) array data. With the availability of whole-genome sequencing (WGS) data, which may contain the causal mutations, there are a growing number of studies to conducting genomic prediction with WGS data. The main objective of this thesis was to investigate the possibility of imputing SNP array data up to the whole genome sequence level (Chapter 2) and then perform genomic prediction based on the imputed WGS data and SNP array data with different genomic relationship matrices to account for genetic architecture (Chapter 3). To further understand the accuracy of genomic prediction, a simulation study was performed to determine the degree of overestimation of the accuracy of genomic prediction, in order to propose a new method (Chapter 4). The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract SNPs out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals that have been genotyped for a subset of SNPs using a genotyping array. Thus, in Chapter 2 we first compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6% (81.6%, 88.0%) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC), defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array, was 0.98 with GATK, 0.98 with SAMtools, and 0.97 with freebayes averaged over all SNPs on the studied chromosomes, respectively. Furthermore, for GATK (SAMtools, freebayes) 90 (88, 75) percent of variants had high values (>0.9) for other quality measures (non-reference sensitivity, non-reference genotype concordance and precision). Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from approximately 1000 individuals from six generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios. Across all imputation programs, correlation between true and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, however, it did have the lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals. In conclusion, among three variant callers tested GATK proved the relatively better performance; Minimac proved the relatively better performance comparing to the other two imputation programs tested. Based on the conclusions in Chapter 2, we applied a genomic prediction with imputed WGS in Chapter 3. A commercial brown layer line comprising of 892 chickens from 6 generations was used in the study. These chickens were genotyped with a high density array data. Using the WGS data of 25 individuals, those array data were imputed up to the sequence level. The imputation was done with Minimac3, which needs pre-phased data generated with Beagle4. Accuracy of genomic prediction was measured as the correlation between de-regressed proofs and direct genomic breeding values of eggshell strength, feed intake and laying rate. In this study, besides the accuracy of genomic prediction based on array data and WGS data, accuracy based on different genomic relationship matrices to account for genetic architecture was investigated. The alternative weighting factors used were uniform, -(〖log〗_10 P) from a t-test of genome wide association study, and the square of estimated SNP effects from random regression BLUP. Best linear unbiased prediction given genetic architecture (BLUP|GA) was investigated as well. Prediction with uniform weights (the original GBLUP) was implemented with all SNPs or with only genic SNPs, both based on array and imputed whole sequence data. Averaging over the studied traits, predictive ability with only genic SNPs in WGS data was 0.366 ± 0.075, which was the highest predictive ability observed in the current study. Genomic prediction with genic SNPs in high density array data provided the second highest accuracy (0.361 ± 0.072). The prediction with -(〖log〗_10 P) or squares of SNP effects as weighting factors for building a genomic relationship matrix or BLUP|GA did not lead to higher accuracy, compared to that with uniform weights, regardless of the SNP set used. The results from this study showed that little or no benefit was gained when using all imputed WGS data to perform genomic prediction compared to using HD array data, regardless of the different SNP weightings tested. However, higher predictive ability was observed when using only genic SNPs extracted from the WGS data for genomic prediction. Decisions of genomic selection schemes are made based on the genomic breeding values (GBV) of selection candidates. Thus, the accuracy of GBV is a relevant parameter, as it reflects the stability of the prediction and the possibility that the GBV might change when more information becomes available. It is also one of the key factors in expected response to selection, which is also known as breeders’ equation. Accuracy of genomic prediction, however, is difficult to assess, considering true breeding values (TBV) of the candidates are not available in reality. In previous studies, several methods are proposed to assess the accuracy of GBV by using population and trait parameters (e.g. the effective population size, the reliability of quasi-phenotypes used, the number of independent chromosome segments) or parameters inferred from the mixed model equations. In practice, most approaches were found to overestimate the accuracy of genomic prediction. Thus, in Chapter 4 we tested several approaches used in previous studies based on simulated data under a variety of parameters mimicking different livestock breeding programs (i.e. a cattle-like and a pig-like as well as a basic scenario) and measured the magnitude of overestimation. Then we proposed a novel and computationally feasible method. Based on the comparison in Chapter 4, the new method provided a better prediction for the accuracy of GBV. The method still had one unknown parameter, for which we suggested an approach to approximate its value from a suitable data set reflecting two separate time points. In conclusion, the new approach provided a better assessment of the accuracy of GBVs in many cases

    Detecting Genotype-Population Interaction Effects by Ancestry Principal Components

    Get PDF
    Heterogeneity in the phenotypic mean and variance across populations is often observed for complex traits. One way to understand heterogeneous phenotypes lies in uncovering heterogeneity in genetic effects. Previous studies on genetic heterogeneity across populations were typically based on discrete groups in populations stratified by different countries or cohorts, which ignored the difference of population characteristics for the individuals within each group and resulted in loss of information. Here, we introduce a novel concept of genotype-by-population (G × P) interaction where population is defined by the first and second ancestry principal components (PCs), which are less likely to be confounded with country/cohort-specific factors. We applied a reaction norm model fitting each of 70 complex traits with significant SNP-heritability and the PCs as covariates to examine G × P interactions across diverse populations including white British and other white Europeans from the UK Biobank (N = 22,229). Our results demonstrated a significant population genetic heterogeneity for behavioral traits such as age at first sexual intercourse and academic qualification. Our approach may shed light on the latent genetic architecture of complex traits that underlies the modulation of genetic effects across different populations

    The genetic relationship between female reproductive traits and six psychiatric disorders

    Get PDF
    Female reproductive behaviours have important implications for evolutionary fitness and health of offspring. Here we used the second release of UK Biobank data (N = 220,685) to evaluate the association between five female reproductive traits and polygenic risk scores (PRS) projected from genome-wide association study summary statistics of six psychiatric disorders (N = 429,178). We found that the PRS of attention-deficit/hyperactivity disorder (ADHD) were strongly associated with age at first birth (AFB) (genetic correlation of -0.68 ± 0.03), age at first sexual intercourse (AFS) (-0.56 ± 0.03), number of live births (NLB) (0.36 ± 0.04) and age at menopause (-0.27 ± 0.04). There were also robustly significant associations between the PRS of eating disorder (ED) and AFB (0.35 ± 0.06), ED and AFS (0.19 ± 0.06), major depressive disorder (MDD) and AFB (-0.27 ± 0.07), MDD and AFS (-0.27 ± 0.03) and schizophrenia and AFS (-0.10 ± 0.03). These associations were mostly explained by pleiotropic effects and there was little evidence of causal relationships. Our findings can potentially help improve reproductive health in women, hence better child outcomes. Our findings also lend partial support to the evolutionary hypothesis that causal mutations underlying psychiatric disorders have positive effects on reproductive success

    Genotype-covariate correlation and interaction disentangled by a whole-genome multivariate reaction norm model

    Get PDF
    The genomics era has brought useful tools to dissect the genetic architecture of complex traits. Here we propose a multivariate reaction norm model (MRNM) to tackle genotype-covariate (G-C) correlation and interaction problems. We apply MRNM to the UK Biobank data in analysis of body mass index using smoking quantity as a covariate, finding a highly significant G-C correlation, but only weak evidence for G-C interaction. In contrast, G-C interaction estimates are inflated in existing methods. It is also notable that there is significant heterogeneity in the estimated residual variances (i.e., variances not attributable to factors in the model) across different covariate levels, i.e., residual-covariate (R-C) interaction. We also show that the residual variances estimated by standard additive models can be inflated in the presence of G-C and/or R-C interactions. We conclude that it is essential to correctly account for both interaction and correlation in complex trait analyses

    Cellular heterogeneity of pluripotent stem cell-derived cardiomyocyte grafts is mechanistically linked to treatable arrhythmias

    Full text link
    Preclinical data have confirmed that human pluripotent stem cell-derived cardiomyocytes (PSC-CMs) can remuscularize the injured or diseased heart, with several clinical trials now in planning or recruitment stages. However, because ventricular arrhythmias represent a complication following engraftment of intramyocardially injected PSC-CMs, it is necessary to provide treatment strategies to control or prevent engraftment arrhythmias (EAs). Here, we show in a porcine model of myocardial infarction and PSC-CM transplantation that EAs are mechanistically linked to cellular heterogeneity in the input PSC-CM and resultant graft. Specifically, we identify atrial and pacemaker-like cardiomyocytes as culprit arrhythmogenic subpopulations. Two unique surface marker signatures, signal regulatory protein α (SIRPA)+CD90−CD200+ and SIRPA+CD90−CD200−, identify arrhythmogenic and non-arrhythmogenic cardiomyocytes, respectively. Our data suggest that modifications to current PSC-CM-production and/or PSC-CM-selection protocols could potentially prevent EAs. We further show that pharmacologic and interventional anti-arrhythmic strategies can control and potentially abolish these arrhythmias

    Age at first birth in women is genetically associated with increased risk of schizophrenia

    Get PDF
    Prof. Paunio on PGC:n jäsenPrevious studies have shown an increased risk for mental health problems in children born to both younger and older parents compared to children of average-aged parents. We previously used a novel design to reveal a latent mechanism of genetic association between schizophrenia and age at first birth in women (AFB). Here, we use independent data from the UK Biobank (N = 38,892) to replicate the finding of an association between predicted genetic risk of schizophrenia and AFB in women, and to estimate the genetic correlation between schizophrenia and AFB in women stratified into younger and older groups. We find evidence for an association between predicted genetic risk of schizophrenia and AFB in women (P-value = 1.12E-05), and we show genetic heterogeneity between younger and older AFB groups (P-value = 3.45E-03). The genetic correlation between schizophrenia and AFB in the younger AFB group is -0.16 (SE = 0.04) while that between schizophrenia and AFB in the older AFB group is 0.14 (SE = 0.08). Our results suggest that early, and perhaps also late, age at first birth in women is associated with increased genetic risk for schizophrenia in the UK Biobank sample. These findings contribute new insights into factors contributing to the complex bio-social risk architecture underpinning the association between parental age and offspring mental health.Peer reviewe

    Estimation of Genetic Correlation via Linkage Disequilibrium Score Regression and Genomic Restricted Maximum Likelihood

    Get PDF
    J. Lönnqvist on työryhmän Psychiat Genomics Consortium jäsen.Genetic correlation is a key population parameter that describes the shared genetic architecture of complex traits and diseases. It can be estimated by current state-of-art methods, i.e., linkage disequilibrium score regression (LDSC) and genomic restricted maximum likelihood (GREML). The massively reduced computing burden of LDSC compared to GREML makes it an attractive tool, although the accuracy (i.e., magnitude of standard errors) of LDSC estimates has not been thoroughly studied. In simulation, we show that the accuracy of GREML is generally higher than that of LDSC. When there is genetic heterogeneity between the actual sample and reference data from which LD scores are estimated, the accuracy of LDSC decreases further. In real data analyses estimating the genetic correlation between schizophrenia (SCZ) and body mass index, we show that GREML estimates based on similar to 150,000 individuals give a higher accuracy than LDSC estimates based on similar to 400,000 individuals (from combinedmeta-data). A GREML genomic partitioning analysis reveals that the genetic correlation between SCZ and height is significantly negative for regulatory regions, which whole genome or LDSC approach has less power to detect. We conclude that LDSC estimates should be carefully interpreted as there can be uncertainty about homogeneity among combined meta-datasets. We suggest that any interesting findings from massive LDSC analysis for a large number of complex traits should be followed up, where possible, with more detailed analyses with GREML methods, even if sample sizes are lesser.Peer reviewe

    Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken.

    No full text
    The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array.First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios.There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals.Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals
    corecore