86 research outputs found

    Impact of imputation methods on the amount of genetic variation captured by a single-nucleotide polymorphism panel in soybeans

    Get PDF
    Background Success in genome-wide association studies and marker-assisted selection depends on good phenotypic and genotypic data. The more complete this data is, the more powerful will be the results of analysis. Nevertheless, there are next-generation technologies that seek to provide genotypic information in spite of great proportions of missing data. The procedures these technologies use to impute genetic data, therefore, greatly affect downstream analyses. This study aims to (1) compare the genetic variance in a single-nucleotide polymorphism panel of soybean with missing data imputed using various methods, (2) evaluate the imputation accuracy and post-imputation quality associated with these methods, and (3) evaluate the impact of imputation method on heritability and the accuracy of genome-wide prediction of soybean traits. The imputation methods we evaluated were as follows: multivariate mixed model, hidden Markov model, logical algorithm, k-nearest neighbor, single value decomposition, and random forest. We used raw genotypes from the SoyNAM project and the following phenotypes: plant height, days to maturity, grain yield, and seed protein composition. Results We propose an imputation method based on multivariate mixed models using pedigree information. Our methods comparison indicate that heritability of traits can be affected by the imputation method. Genotypes with missing values imputed with methods that make use of genealogic information can favor genetic analysis of highly polygenic traits, but not genome-wide prediction accuracy. The genotypic matrix captured the highest amount of genetic variance when missing loci were imputed by the method proposed in this paper. Conclusions We concluded that hidden Markov models and random forest imputation are more suitable to studies that aim analyses of highly heritable traits while pedigree-based methods can be used to best analyze traits with low heritability. Despite the notable contribution to heritability, advantages in genomic prediction were not observed by changing the imputation method. We identified significant differences across imputation methods in a dataset missing 20 % of the genotypic values. It means that genotypic data from genotyping technologies that provide a high proportion of missing values, such as GBS, should be handled carefully because the imputation method will impact downstream analysis

    Analyse de la variation nucléotidique et structurale chez le soja par une approche de re-séquençage

    Get PDF
    Le séquençage de nouvelle génération (NGS) a révolutionné la recherche chez les plantes et les animaux de plusieurs façons, y compris via le développement de nouvelles méthodes de génotypage à haut débit pour accélérer considérablement l'étude de la composition des génomes et de leurs fonctions. Dans le cadre du projet SoyaGen, financé par Génome Canada, nous cherchons à mieux comprendre la diversité génétique et l'architecture sous-jacente régissant les principaux caractères agronomiques chez le soja. Le soja est la plus importante culture oléagineuse au monde en termes économiques. Dans cette étude, nous avons cherché à exploiter les technologies NGS afin de contribuer à l'élucidation des caractéristiques génomiques du soja. Pour ce faire, trois axes de recherche ont formé le cœur de cette thèse : 1) le génotypage pan-génomique à faible coût, 2) la caractérisation exhaustive des variants génétiques par reséquençage complet et 3) l’identification de mutations à fort impact fonctionnel sur la base d’une forte sélection au sein des lignées élites. Un premier défi en analyse génétique ou génomique est de rendre possible une caractérisation rapide et peu coûteuse d’un grand nombre de lignées à un très grand nombre de marqueurs répartis sur tout le génome. Le génotypage par séquençage (GBS) permet d'effectuer simultanément l’identification et le génotypage de plusieurs milliers de SNP à l'échelle du génome. Un des grands défis en analyse GBS est d’extraire, d’une montagne de données issues du séquençage, un grand catalogue de SNP de haute qualité et de minimiser l’impact des données manquantes. Dans une première étape, nous avons grandement amélioré le GBS en développant un nouveau pipeline d’analyse bio-informatique, Fast-GBS, conçu pour produire un appel de génotypes plus précis et plus rapide que les outils existants. De plus, nous avons optimisé des outils permettant d’effectuer l'imputation des données manquantes. Ainsi, nous avons pu obtenir un catalogue de 60K marqueurs SNP au sein d’une collection de 301 accessions qui se voulait représentative de la diversité du soja au Canada. Dans un second temps, toutes les données manquantes (~50%) ont été imputées avec un très grand degré d’exactitude (98 %). Cette caractérisation génétique a été réalisée pour un coût modique, soit moins de 15parligneˊe.Deuxieˋmement,pourcaracteˊriserdemanieˋreexhaustivelesvariationsnucleˊotidiquesetstructurelles(SNVetSV,respectivement)danslegeˊnomedusoja,nousavonsseˊquenceˊlegeˊnomeentierde102accessionsdesojaauCanada.Nousavonsidentifieˊpreˋsde5Mdevariantsnucleˊotidiques(SNP,MNPetIndels)avecunhautniveaud’exactitude(98,6 par lignée. Deuxièmement, pour caractériser de manière exhaustive les variations nucléotidiques et structurelles (SNV et SV, respectivement) dans le génome du soja, nous avons séquencé le génome entier de 102 accessions de soja au Canada. Nous avons identifié près de 5M de variants nucléotidiques (SNP, MNP et Indels) avec un haut niveau d’exactitude (98,6 %). Ensuite, en utilisant une combinaison de trois approches différentes, nous avons détecté ~92K SV (délétions, insertions, inversions, duplications, CNV et translocations) et estimé que plus de 90 % étaient exacts. C'est la première fois qu'une description complète de la diversité des haplotypes SNP et du SV a été réalisée chez une espèce cultivée. Enfin, nous avons mis au point une approche analytique systématique pour faciliter grandement l’identification de gènes dont des allèles ont fait l’objet d’une très forte sélection au cours de la domestication et de la sélection. Cette approche repose sur deux progrès récents en génomique : 1) le séquençage de génomes entiers et 2) la prédiction des mutations entraînant une perte de fonction (LOF pour « loss of function »). En utilisant cette approche, nous avons identifié 130 gènes candidats liés à la domestication ou à la sélection chez le soja. Ce catalogue contient tous les gènes de domestication précédemment caractérisés chez le soja, ainsi que certains orthologues chez d'autres espèces cultivées. Cette liste de gènes fournit de nombreuses pistes d’investigation pour des études visant à mieux comprendre les gènes qui contribuent fortement à façonner le soja cultivé. Cette thèse permet ultimement une meilleure compréhension des caractéristiques génomiques du soja. En outre, elle fournit plusieurs outils et références génomiques qui pourraient facilement être utilisés dans de futures recherches en génomique chez le soja de même que chez d’autres espèces.Next-generation sequencing (NGS) has revolutionized plants and animals research in many ways, including the development of new high-throughput genotyping methods to accelerate considerably the composition of genomes and their functions. As part of the SoyaGen project, funded by Genome Canada, we are seeking to better understand the genetic diversity and underlying architecture governing major agronomic traits in soybeans. Soybean is the world's largest oilseed crop in economic terms. In this study, we sought to exploit NGS technologies to help elucidate the genomic characteristics of soybeans. To this end, three main research topics have formed the core of this thesis: 1) low-cost genome-wide genotyping, 2) exhaustive characterization of genetic variants by whole-genome resequencing, and 3) identification of mutations with high functional impact on the basis of a strong selection within the elite lines. A first challenge in genetic or genomic analysis is to make possible a rapid and inexpensive characterization of a large number of lines with a very large number of markers distributed throughout the genome. Genotyping-by-sequencing (GBS) allows simultaneous identification and genotyping of several thousand SNPs on a genome-wide scale. One of the major challenges in GBS analysis is to extract a large catalog of high quality SNP from a mountain of sequencing data and minimize the impact of missing data. As a first step, we have greatly improved the GBS by developing a new bio-informatics analysis pipeline, Fast-GBS, designed to produce a more accurate and faster call of genotypes than existing tools. In addition, we have optimized tools for imputing missing data. For example, we were able to obtain a catalog of 60K SNP markers from a collection of 301 accessions that were representative of soybean diversity in Canada. Second, all missing data (~ 50%) were imputed with a very high degree of accuracy (98%). This genetic characterization was performed at a low cost, less than 15 per line. Second, to fully characterize the nucleotide and structural variations (SNV and SV, respectively) in the soybean genome, we sequenced the whole genome of 102 Canadian soybean accessions. We have identified nearly 5M of nucleotide variants (SNP, MNP and Indels) with a high level of accuracy (98.6%). Then, using a combination of three different approaches, we detected ~ 92K SV (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that more than 90% were accurate. This is the first time that a complete description of the diversity of SNP and SV haplotypes has been carried out in a cultivated species. Finally, we have developed a systematic analytical approach to greatly facilitate the identification of genes whose alleles have undergone a very strong selection during domestication and selection. This approach is based on two recent advances in genomics: (1) whole-genome sequencing and (2) predicting mutations resulting in loss of function (LOF). Using this approach, we identified 130 candidate genes related to domestication or selection in soybean. This catalogue contains all of the previously well-characterized domestication genes in soybean, as well as some orthologues from other domesticated crop species. This list of genes provides many avenues of investigation for studies aimed at better understanding the genes that contribute strongly to shaping cultivated soybeans. This thesis ultimately leads to a better understanding of the genomic characteristics of soybeans. In addition, it provides several tools and genomic resources that could easily be used in future genomic research in soybeans as well as in other species

    Learning from data: Plant breeding applications of machine learning

    Get PDF
    Increasingly, new sources of data are being incorporated into plant breeding pipelines. Enormous amounts of data from field phenomics and genotyping technologies places data mining and analysis into a completely different level that is challenging from practical and theoretical standpoints. Intelligent decision-making relies on our capability of extracting from data useful information that may help us to achieve our goals more efficiently. Many plant breeders, agronomists and geneticists perform analyses without knowing relevant underlying assumptions, strengths or pitfalls of the employed methods. The study endeavors to assess statistical learning properties and plant breeding applications of supervised and unsupervised machine learning techniques. A soybean nested association panel (aka. SoyNAM) was the base-population for experiments designed in situ and in silico. We used mixed models and Markov random fields to evaluate phenotypic-genotypic-environmental associations among traits and learning properties of genome-wide prediction methods. Alternative methods for analyses were proposed

    Normalizing Flows for Knockoff-free Controlled Feature Selection

    Full text link
    Controlled feature selection aims to discover the features a response depends on while limiting the false discovery rate (FDR) to a predefined level. Recently, multiple deep-learning-based methods have been proposed to perform controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the FDR for two reasons. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect that remedies both of these problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. To circumvent the need to enforce the swap property, FlowSelect uses a novel MCMC-based procedure to calculate p-values for each feature directly. Asymptotically, FlowSelect computes valid p-values. Empirically, FlowSelect consistently controls the FDR on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also demonstrates greater power on these benchmarks. Additionally, FlowSelect correctly infers the genetic variants associated with specific soybean traits from GWAS data.Comment: 20 pages, 9 figures, 3 table

    Single Nucleotide Polymorphisms (SNPs) in Plant Genetics and Breeding

    Get PDF
    Recent advances in genome technology revealed various single nucleotide polymorphisms (SNPs), the most common form of DNA sequence variation between alleles, in several plant species. The discovery and application of SNPs increased our knowledge about genetic diversity and a better understanding on crop improvement. Natural breeding process which takes an agelong time during collecting, cultivating, and domestication has been accelerated by detecting dozens of SNPs on various species using advanced biotechnological techniques such as next-generation sequencing. This will result in the improvement of economically important traits. Therefore, we would like to focus on the discovery, current technologies, and applications of SNPs in breeding. The chapter covers the following topics: (1) introduction, (2) application of SNPs, (3) techniques to detect SNPs, (4) importance of SNPs for crop improvement, and (5) conclusion

    Genomic and Physiological Approaches to Improve Drought Tolerance in Soybean

    Get PDF
    Drought stress is a major global constraint for crop production, and improving crop tolerance to drought is of critical importance. Direct selection of drought tolerance among genotypes for yield is limited because of low heritability, polygenic control, epistasis effects, and genotype by environment interactions. Crop physiology can play a major role for improving drought tolerance through the identification of traits associated with drought tolerance that can be used as indirect selection criteria in a breeding program. Carbon isotope ratio (δ13C, associated with water use efficiency), oxygen isotope ratio (δ18O, associated with transpiration), canopy temperature (CT), canopy wilting, and canopy coverage (CC) are promising physiological traits associated with improvement of drought tolerance. Genome-wide association studies (GWAS) are one of the genomic approaches to provide a high mapping resolution for complex trait variation such as those related to drought tolerance. The objectives of this research were to identify genomic regions and favorable alleles that contribute to drought-tolerant traits. A diverse panel consisting of 373 maturity group (MG) IV soybean accessions was evaluated for δ13C, δ18O, canopy wilting, canopy coverage, and canopy temperature in multiple environments. A set of 31,260 polymorphic SNPs with a minor allele frequency (MAF) ≥ 5% was used for association mapping of CT using the FarmCPU model. Association mapping identified 54 significant SNPs associated with δ13C, 47 significant SNPs associated with δ18O, 61 significant SNPs associated canopy wilting, 41 and 56 significant SNPs associated with CC for first and second measurements dates, respectively, and 52 significant SNPs associated with CT. Several genes were identified using these significant SNPs, and those genes had reported functions related to transpiration, water transport, growth, developmental, root development, response to abscisic acid stimulus, and stomatal complex morphogenesis. Favorable alleles from significant SNPs may be an important resource for pyramiding genes to improve drought tolerance and for identifying parental genotypes for use in breeding programs

    Genomic selection and quantitative trait loci mapping for Fusarium head blight resistance in wheat (Triticum aestivum L.)

    Get PDF
    Fusarium head blight (FHB) is a destructive disease of wheat (Triticum aestivum L.) occurring in most growing areas. The disease is primarily caused by Fusarium graminearum Schwabe [telemorph: Gibberella zeae Schw. (Petch)], in North America, and the majority of current wheat cultivars are susceptible to it. Significant economic losses are associated with FHB since it results in yield reduction, poor kernel quality and grain contamination by mycotoxins, such as deoxynivalenol. Resistance to FHB has been identified in the wheat gene pool, but breeding for it remains a challenge for several reasons, including the complex genetic control of resistance and poor adaptability of the traditional sources. In this context, molecular markers could contribute to the identification of genomic regions associated with FHB resistance. In addition, markers could be used to calculate breeding values for wheat lines, for traits related to resistance disease. In the first study of this dissertation, genomic selection (GS) models were compared for predicting traits associated with resistance to FHB resistance, using 273 breeding lines in use at the University of Illinois’ soft red winter wheat breeding program. Genotyping-by-sequencing (GBS) was used to identify 5,054 single nucleotide polymorphisms (SNPs) which were then treated as predictor variables in GS analysis. Different parameters affecting the prediction accuracy of the genomic estimated breeding values (GEBVs) were tested, including: i) five genotypic imputation methods (random forest imputation – RFI, expectation maximization imputation – EMI, k-nearest neighbor imputation – KNNI, singular value decomposition imputation – SVDI and the mean imputation – MNI); ii) three statistical models (ridge regression best linear unbiased predictor – RR-BLUP, least absolute shrinkage and operator selector – LASSO, and elastic net); iii) marker density (p = 500, 1500, 3000, and 4500 SNPs); and iv) training population size (nTP = 96, 144, 192, and 218). No discernable differences in prediction accuracy were observed among imputation methods. For five of six traits, RR-BLUP outperformed other statistical models (LASSO and elastic-net), and a significant reduction in prediction accuracy was observed when marker number decreased to 3000 or 1500 SNPs, depending on the trait. Lastly, prediction accuracies decreased significantly when the sample size of the training set was less than 192. The second study consisted in a genome-wide association study (GWAS) performed on the same panel used in the first study. A total of 19.992 SNPs were obtained with GBS and ten significant SNP-trait associations were detected for multiple parameters associated with FHB resistance on chromosomes 4A, 6A, 7A, 1D, 4D, 7D, and multiple SNPs were associated with Fhb-1 on chromosome 3B. Fhb-1 is a major effect QTL identified in China, and it is very popular among wheat breeders worldwide. The genomic region on chromosome 6A appears to be new, as no other study reported QTL for that region. In addition, combination of favorable alleles of these SNPs resulted in lower levels of disease. The third study compared marker-assisted selection (MAS) with GS using different sets of genotypic data, including the QTL identified in the second study and Fhb-1. GS greatly outperformed MAS, with cross-validated prediction accuracy varying from 0.24 to 0.74 and from 0.59 to 0.98 for MAS and GS, respectively. Treating QTL as fixed effects in GS models resulted in higher prediction accuracy when compared with a GS model with only random effects. For the same selection intensity, GS resulted in higher selection differentials than MAS for all traits. This study indicates that GS is a more appropriate strategy than MAS for FHB resistance. The last study of this dissertation was concerned with a linkage mapping study using a population of 233 recombinant inbred lines obtained from IL97-181 (resistant) and Clark (susceptible). Neither parent possesses the traditional Asian sources of resistance to FHB in their pedigree. A total of 2275 single nucleotide polymorphisms (SNPs) were detected using genotyping-by-sequencing (GBS) and a genetic map was built covering all 21 wheat chromosomes. Inclusive composite interval mapping (ICIM) analysis identified four genomic regions associated with multiple FHB parameters, across all environments. Four QTL were detected for FHB resistance under field conditions on chromosomes 1B, 2D, 6D, and 7B. Two QTL were associated with type I resistance (6D and 7B), and two were associated with type II resistance (1B and 2D). The percentage of the phenotypic variation explained by these QTL varied between 6.7 and 12.5%. For QTL on sub-genome B, intervals smaller than 2 cM were obtained. The results show that elite germplasm can contribute to FHB resistance

    Searching for the Genetic Basis of Hygienic Behavior and Overwintering in the Honeybee (Apis mellifera)

    Get PDF
    The recent decline in honeybee populations can be mitigated through genomics and marker-assisted selection. The current techniques, such as chemical treatment to prevent disease, are only short-term solutions. The ability to breed honeybees that are disease and winter resistant would be ideal. Current breeding techniques lack knowledge of predictive markers that may improve these traits. Here we perform a genome-wide association study on 925 colonies by measuring hygienic and overwintering behavior of the colonies, followed by sequencing their genomes. L1 regression is a technique developed to pick the best Single Nucleotide Polymorphisms that explain the variance in the phenotype. Using L1 regression, we found 27 Single Nucleotide Polymorphisms for hygiene and 32 Single Nucleotide Polymorphisms for overwintering behaviour that could be used to breed for healthier and winter hardy honeybees
    • …
    corecore