86 research outputs found
Impact of imputation methods on the amount of genetic variation captured by a single-nucleotide polymorphism panel in soybeans
Background
Success in genome-wide association studies and marker-assisted selection depends on good phenotypic and genotypic data. The more complete this data is, the more powerful will be the results of analysis. Nevertheless, there are next-generation technologies that seek to provide genotypic information in spite of great proportions of missing data. The procedures these technologies use to impute genetic data, therefore, greatly affect downstream analyses. This study aims to (1) compare the genetic variance in a single-nucleotide polymorphism panel of soybean with missing data imputed using various methods, (2) evaluate the imputation accuracy and post-imputation quality associated with these methods, and (3) evaluate the impact of imputation method on heritability and the accuracy of genome-wide prediction of soybean traits. The imputation methods we evaluated were as follows: multivariate mixed model, hidden Markov model, logical algorithm, k-nearest neighbor, single value decomposition, and random forest. We used raw genotypes from the SoyNAM project and the following phenotypes: plant height, days to maturity, grain yield, and seed protein composition.
Results
We propose an imputation method based on multivariate mixed models using pedigree information. Our methods comparison indicate that heritability of traits can be affected by the imputation method. Genotypes with missing values imputed with methods that make use of genealogic information can favor genetic analysis of highly polygenic traits, but not genome-wide prediction accuracy. The genotypic matrix captured the highest amount of genetic variance when missing loci were imputed by the method proposed in this paper.
Conclusions
We concluded that hidden Markov models and random forest imputation are more suitable to studies that aim analyses of highly heritable traits while pedigree-based methods can be used to best analyze traits with low heritability. Despite the notable contribution to heritability, advantages in genomic prediction were not observed by changing the imputation method. We identified significant differences across imputation methods in a dataset missing 20 % of the genotypic values. It means that genotypic data from genotyping technologies that provide a high proportion of missing values, such as GBS, should be handled carefully because the imputation method will impact downstream analysis
Analyse de la variation nucléotidique et structurale chez le soja par une approche de re-séquençage
Le séquençage de nouvelle génération (NGS) a révolutionné la recherche chez les plantes et les animaux de plusieurs façons, y compris via le développement de nouvelles méthodes de génotypage à haut débit pour accélérer considérablement l'étude de la composition des génomes et de leurs fonctions. Dans le cadre du projet SoyaGen, financé par Génome Canada, nous cherchons à mieux comprendre la diversité génétique et l'architecture sous-jacente régissant les principaux caractères agronomiques chez le soja. Le soja est la plus importante culture oléagineuse au monde en termes économiques. Dans cette étude, nous avons cherché à exploiter les technologies NGS afin de contribuer à l'élucidation des caractéristiques génomiques du soja. Pour ce faire, trois axes de recherche ont formé le cœur de cette thèse : 1) le génotypage pan-génomique à faible coût, 2) la caractérisation exhaustive des variants génétiques par reséquençage complet et 3) l’identification de mutations à fort impact fonctionnel sur la base d’une forte sélection au sein des lignées élites. Un premier défi en analyse génétique ou génomique est de rendre possible une caractérisation rapide et peu coûteuse d’un grand nombre de lignées à un très grand nombre de marqueurs répartis sur tout le génome. Le génotypage par séquençage (GBS) permet d'effectuer simultanément l’identification et le génotypage de plusieurs milliers de SNP à l'échelle du génome. Un des grands défis en analyse GBS est d’extraire, d’une montagne de données issues du séquençage, un grand catalogue de SNP de haute qualité et de minimiser l’impact des données manquantes. Dans une première étape, nous avons grandement amélioré le GBS en développant un nouveau pipeline d’analyse bio-informatique, Fast-GBS, conçu pour produire un appel de génotypes plus précis et plus rapide que les outils existants. De plus, nous avons optimisé des outils permettant d’effectuer l'imputation des données manquantes. Ainsi, nous avons pu obtenir un catalogue de 60K marqueurs SNP au sein d’une collection de 301 accessions qui se voulait représentative de la diversité du soja au Canada. Dans un second temps, toutes les données manquantes (~50%) ont été imputées avec un très grand degré d’exactitude (98 %). Cette caractérisation génétique a été réalisée pour un coût modique, soit moins de 15 15 per line. Second, to fully characterize the nucleotide and structural variations (SNV and SV, respectively) in the soybean genome, we sequenced the whole genome of 102 Canadian soybean accessions. We have identified nearly 5M of nucleotide variants (SNP, MNP and Indels) with a high level of accuracy (98.6%). Then, using a combination of three different approaches, we detected ~ 92K SV (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that more than 90% were accurate. This is the first time that a complete description of the diversity of SNP and SV haplotypes has been carried out in a cultivated species. Finally, we have developed a systematic analytical approach to greatly facilitate the identification of genes whose alleles have undergone a very strong selection during domestication and selection. This approach is based on two recent advances in genomics: (1) whole-genome sequencing and (2) predicting mutations resulting in loss of function (LOF). Using this approach, we identified 130 candidate genes related to domestication or selection in soybean. This catalogue contains all of the previously well-characterized domestication genes in soybean, as well as some orthologues from other domesticated crop species. This list of genes provides many avenues of investigation for studies aimed at better understanding the genes that contribute strongly to shaping cultivated soybeans. This thesis ultimately leads to a better understanding of the genomic characteristics of soybeans. In addition, it provides several tools and genomic resources that could easily be used in future genomic research in soybeans as well as in other species
Learning from data: Plant breeding applications of machine learning
Increasingly, new sources of data are being incorporated into plant breeding pipelines. Enormous amounts of data from field phenomics and genotyping technologies places data mining and analysis into a completely different level that is challenging from practical and theoretical standpoints. Intelligent decision-making relies on our capability of extracting from data useful information that may help us to achieve our goals more efficiently. Many plant breeders, agronomists and geneticists perform analyses without knowing relevant underlying assumptions, strengths or pitfalls of the employed methods. The study endeavors to assess statistical learning properties and plant breeding applications of supervised and unsupervised machine learning techniques. A soybean nested association panel (aka. SoyNAM) was the base-population for experiments designed in situ and in silico. We used mixed models and Markov random fields to evaluate phenotypic-genotypic-environmental associations among traits and learning properties of genome-wide prediction methods. Alternative methods for analyses were proposed
Normalizing Flows for Knockoff-free Controlled Feature Selection
Controlled feature selection aims to discover the features a response depends
on while limiting the false discovery rate (FDR) to a predefined level.
Recently, multiple deep-learning-based methods have been proposed to perform
controlled feature selection through the Model-X knockoff framework. We
demonstrate, however, that these methods often fail to control the FDR for two
reasons. First, these methods often learn inaccurate models of features.
Second, the "swap" property, which is required for knockoffs to be valid, is
often not well enforced. We propose a new procedure called FlowSelect that
remedies both of these problems. To more accurately model the features,
FlowSelect uses normalizing flows, the state-of-the-art method for density
estimation. To circumvent the need to enforce the swap property, FlowSelect
uses a novel MCMC-based procedure to calculate p-values for each feature
directly. Asymptotically, FlowSelect computes valid p-values. Empirically,
FlowSelect consistently controls the FDR on both synthetic and semi-synthetic
benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also
demonstrates greater power on these benchmarks. Additionally, FlowSelect
correctly infers the genetic variants associated with specific soybean traits
from GWAS data.Comment: 20 pages, 9 figures, 3 table
Single Nucleotide Polymorphisms (SNPs) in Plant Genetics and Breeding
Recent advances in genome technology revealed various single nucleotide polymorphisms (SNPs), the most common form of DNA sequence variation between alleles, in several plant species. The discovery and application of SNPs increased our knowledge about genetic diversity and a better understanding on crop improvement. Natural breeding process which takes an agelong time during collecting, cultivating, and domestication has been accelerated by detecting dozens of SNPs on various species using advanced biotechnological techniques such as next-generation sequencing. This will result in the improvement of economically important traits. Therefore, we would like to focus on the discovery, current technologies, and applications of SNPs in breeding. The chapter covers the following topics: (1) introduction, (2) application of SNPs, (3) techniques to detect SNPs, (4) importance of SNPs for crop improvement, and (5) conclusion
Genomic and Physiological Approaches to Improve Drought Tolerance in Soybean
Drought stress is a major global constraint for crop production, and improving crop tolerance to drought is of critical importance. Direct selection of drought tolerance among genotypes for yield is limited because of low heritability, polygenic control, epistasis effects, and genotype by environment interactions. Crop physiology can play a major role for improving drought tolerance through the identification of traits associated with drought tolerance that can be used as indirect selection criteria in a breeding program. Carbon isotope ratio (δ13C, associated with water use efficiency), oxygen isotope ratio (δ18O, associated with transpiration), canopy temperature (CT), canopy wilting, and canopy coverage (CC) are promising physiological traits associated with improvement of drought tolerance. Genome-wide association studies (GWAS) are one of the genomic approaches to provide a high mapping resolution for complex trait variation such as those related to drought tolerance. The objectives of this research were to identify genomic regions and favorable alleles that contribute to drought-tolerant traits. A diverse panel consisting of 373 maturity group (MG) IV soybean accessions was evaluated for δ13C, δ18O, canopy wilting, canopy coverage, and canopy temperature in multiple environments. A set of 31,260 polymorphic SNPs with a minor allele frequency (MAF) ≥ 5% was used for association mapping of CT using the FarmCPU model. Association mapping identified 54 significant SNPs associated with δ13C, 47 significant SNPs associated with δ18O, 61 significant SNPs associated canopy wilting, 41 and 56 significant SNPs associated with CC for first and second measurements dates, respectively, and 52 significant SNPs associated with CT. Several genes were identified using these significant SNPs, and those genes had reported functions related to transpiration, water transport, growth, developmental, root development, response to abscisic acid stimulus, and stomatal complex morphogenesis. Favorable alleles from significant SNPs may be an important resource for pyramiding genes to improve drought tolerance and for identifying parental genotypes for use in breeding programs
Genomic selection and quantitative trait loci mapping for Fusarium head blight resistance in wheat (Triticum aestivum L.)
Fusarium head blight (FHB) is a destructive disease of wheat (Triticum aestivum L.) occurring in most growing areas. The disease is primarily caused by Fusarium graminearum Schwabe [telemorph: Gibberella zeae Schw. (Petch)], in North America, and the majority of current wheat cultivars are susceptible to it. Significant economic losses are associated with FHB since it results in yield reduction, poor kernel quality and grain contamination by mycotoxins, such as deoxynivalenol. Resistance to FHB has been identified in the wheat gene pool, but breeding for it remains a challenge for several reasons, including the complex genetic control of resistance and poor adaptability of the traditional sources. In this context, molecular markers could contribute to the identification of genomic regions associated with FHB resistance. In addition, markers could be used to calculate breeding values for wheat lines, for traits related to resistance disease. In the first study of this dissertation, genomic selection (GS) models were compared for predicting traits associated with resistance to FHB resistance, using 273 breeding lines in use at the University of Illinois’ soft red winter wheat breeding program. Genotyping-by-sequencing (GBS) was used to identify 5,054 single nucleotide polymorphisms (SNPs) which were then treated as predictor variables in GS analysis. Different parameters affecting the prediction accuracy of the genomic estimated breeding values (GEBVs) were tested, including: i) five genotypic imputation methods (random forest imputation – RFI, expectation maximization imputation – EMI, k-nearest neighbor imputation – KNNI, singular value decomposition imputation – SVDI and the mean imputation – MNI); ii) three statistical models (ridge regression best linear unbiased predictor – RR-BLUP, least absolute shrinkage and operator selector – LASSO, and elastic net); iii) marker density (p = 500, 1500, 3000, and 4500 SNPs); and iv) training population size (nTP = 96, 144, 192, and 218). No discernable differences in prediction accuracy were observed among imputation methods. For five of six traits, RR-BLUP outperformed other statistical models (LASSO and elastic-net), and a significant reduction in prediction accuracy was observed when marker number decreased to 3000 or 1500 SNPs, depending on the trait. Lastly, prediction accuracies decreased significantly when the sample size of the training set was less than 192. The second study consisted in a genome-wide association study (GWAS) performed on the same panel used in the first study. A total of 19.992 SNPs were obtained with GBS and ten significant SNP-trait associations were detected for multiple parameters associated with FHB resistance on chromosomes 4A, 6A, 7A, 1D, 4D, 7D, and multiple SNPs were associated with Fhb-1 on chromosome 3B. Fhb-1 is a major effect QTL identified in China, and it is very popular among wheat breeders worldwide. The genomic region on chromosome 6A appears to be new, as no other study reported QTL for that region. In addition, combination of favorable alleles of these SNPs resulted in lower levels of disease. The third study compared marker-assisted selection (MAS) with GS using different sets of genotypic data, including the QTL identified in the second study and Fhb-1. GS greatly outperformed MAS, with cross-validated prediction accuracy varying from 0.24 to 0.74 and from 0.59 to 0.98 for MAS and GS, respectively. Treating QTL as fixed effects in GS models resulted in higher prediction accuracy when compared with a GS model with only random effects. For the same selection intensity, GS resulted in higher selection differentials than MAS for all traits. This study indicates that GS is a more appropriate strategy than MAS for FHB resistance. The last study of this dissertation was concerned with a linkage mapping study using a population of 233 recombinant inbred lines obtained from IL97-181 (resistant) and Clark (susceptible). Neither parent possesses the traditional Asian sources of resistance to FHB in their pedigree. A total of 2275 single nucleotide polymorphisms (SNPs) were detected using genotyping-by-sequencing (GBS) and a genetic map was built covering all 21 wheat chromosomes. Inclusive composite interval mapping (ICIM) analysis identified four genomic regions associated with multiple FHB parameters, across all environments. Four QTL were detected for FHB resistance under field conditions on chromosomes 1B, 2D, 6D, and 7B. Two QTL were associated with type I resistance (6D and 7B), and two were associated with type II resistance (1B and 2D). The percentage of the phenotypic variation explained by these QTL varied between 6.7 and 12.5%. For QTL on sub-genome B, intervals smaller than 2 cM were obtained. The results show that elite germplasm can contribute to FHB resistance
Searching for the Genetic Basis of Hygienic Behavior and Overwintering in the Honeybee (Apis mellifera)
The recent decline in honeybee populations can be mitigated through genomics and marker-assisted selection. The current techniques, such as chemical treatment to prevent disease, are only short-term solutions. The ability to breed honeybees that are disease and winter resistant would be ideal. Current breeding techniques lack knowledge of predictive markers that may improve these traits. Here we perform a genome-wide association study on 925 colonies by measuring hygienic and overwintering behavior of the colonies, followed by sequencing their genomes. L1 regression is a technique developed to pick the best Single Nucleotide Polymorphisms that explain the variance in the phenotype. Using L1 regression, we found 27 Single Nucleotide Polymorphisms for hygiene and 32 Single Nucleotide Polymorphisms for overwintering behaviour that could be used to breed for healthier and winter hardy honeybees
- …