50 research outputs found

    Haplotype estimation in polyploids using DNA sequence data

    Get PDF
    Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

    Genotype imputation as a genomic strategy for the SA Drakensberger beef breed

    Get PDF
    Indigenous breeds such as the South African (SA) Drakensberger are economically important genetic resources in local beef production because of their adaptive traits and ability to perform competitively at a commercial level. Genomic selection (GS) is a promising technology to accelerate genetic progress in traits relevant to commercial beef production. A major obstacle in applying this methodology has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNPs). Cost reduction can be achieved by exploiting genotype imputation in GS workflows by means of genotyping at lower densities and imputing upwards. The overarching aim of this study was to conduct an investigation into the practicality of applying imputation in such a workflow utilizing genotypic data for 1 135 SA Drakensberger animals genotyped for 139 480 SNPs. As a pre-imputation step, the objective was firstly to elucidate inter- and intra-chromosomal patterns in genomic characteristics that may contribute to variability in achievable imputation accuracy across the genome. Inter-chromosomal differences in the proportion of low minor allele frequency (MAF) SNPs estimated varied from 6.6% for Bos Taurus autosome (BTA) 23 to 16.0% for BTA14. Pairwise linkage disequilibrium (LD), between adjacent SNPs, ranged from r2=0.11 (BTA28) to 0.17 (BTA14). The largest run of homozygosity (ROH), located on BTA13, was 225.82 kilobases (kb) in length and was identified in 23% of the animals sampled. The ROH-based inbreeding coefficients (FROH) estimated (e.g. FROH>1Mb=0.07, where FROH>1Mb denotes FROH calculated for all ROH longer than 1 megabase pair), indicated sufficient within-breed relatedness to achieve accurate imputation. During the imputation step, imputation accuracy from several custom-derived lower density panels varying in SNP density and the SNP selection strategy were compared. Imputation accuracy increased as SNP density increased; a genotyping panel consisting of 10 000 SNPs, selected based on a combination of their MAF and LD with neighbouring SNPs, could be used to achieve <3% imputation error on average. At this density of SNPs, a mean correlation coefficient (±standard deviation) between true- and imputed SNPs of 0.972±0.024 was achieved in a set of validation animals (n=235). Low MAF SNPs were imputed with lesser accuracy; a difference of 0.071 units was observed between the mean accuracy of imputed SNP categorized into low- (0.01<MAF≀0.1) versus high MAF (0.4<MAF<0.5) classes. Post-imputation, the utility of imputed genotypes in genomic breeding value (GEBV) estimation was evaluated by comparing prediction accuracies achieved from the use of true versus imputed SNPs in generating the H-inverse matrix applied in single-step GS. Breeding values were estimated for two growth traits, considering direct and maternal components. Prediction accuracies were improved by using genomic information in addition to traditional pedigree information; the largest improvement (0.026 units increase in accuracy) was observed for maternal birth weight. Marginal differences were observed between GEBV accuracies produced from true (GEBV_TRUE) versus imputed genotypes (GEBV_IMPUTED); for example the mean±standard deviation in GEBV_TRUE=0.774±0.056 versus GEBV_IMPUTED=0.773±0.055 accuracy was observed for direct birth weight, suggesting that imputation errors had an almost negligible influence. Results presented in this thesis demonstrated the usefulness of imputation as a viable genomic strategy towards low-cost implementation of genomically enhanced prediction of EBVs for a breed such as the SA Drakensberger.Thesis (PhD)--University of Pretoria, 2020.Animal and Wildlife SciencesPhD (Animal Science)Unrestricte

    Genomic evaluation considering the mosaic genome of the crossbred pig

    Get PDF
    In pigs, the breeding goal is to improve performance of crossbred (CB) animals in commercial farms. The best purebred (PB) animals to produce CB animals can be selected based on their genomic estimated breeding value (GEBV) for CB performance. GEBVs are the result of combining estimated effects of single nucleotide polymorphisms (SNPs) with the animal’s genotype. Using CB genomic information allows to estimate SNP allele effects accounting for the CB genetic background. The genome of CB animals is a mosaic of genomic regions inherited from the different parental breeds, therefore, this thesis aimed to investigate whether SNP alleles have different effects depending from which parental breed the allele was inherited and study the impact on GEBV of PB animals for CB performance when breed-specific allele effects were taken into consideration. Firstly, I showed that around 94 % of alleles of three-way CB pigs can be assigned a breed of origin. Knowing this, allowed me to implement a model that accounts for breed-specific effects of all SNP alleles. Using results of this model, I showed that estimated effects and explained variance of SNPs strongly associated with CB performance are different depending upon from which parental breed they were inherited, however, the majority of the genomic regions are not or only weakly associated with CB performance. Therefore, I implemented a new model that allows to estimate breed-specific effects only for alleles of SNPs strongly associated with CB performance, and for the rest of the SNPs assumes that allele effects are the same across breeds. Differences of prediction accuracies between models were generally small. When the estimated genetic correlation between the performance of PB and CB animals per breed of origin differed largely across models, it was better to use models that make a distinction of alleles according to their breed of origin and whether or not they were associated to the trait.</p

    Methods and Applications for Collection, Contamination Estimation, and Linkage Analysis of Large-scale Human Genotype Data

    Full text link
    In recent decades statistical genetics has contributed substantially to our knowledge of human health and biology. This research has many facets: from collecting data, to cleaning, to analyzing. As the scope of the scientific questions considered and the scale of the data continue to increase, these bring additional challenges to every step of the process. In this dissertation, I describe novel approaches for each of these three steps, focused on the specific problems of participant recruitment and engagement, DNA contamination estimation, and linkage analysis with large data sets. In Chapter 1, we introduce the subject of this dissertation and how it fits with other developments in the generation, analysis and interpretation of human genetic data. In Chapter 2, we describe Genes for Good, a new platform for engaging a large, diverse participant pool in genetics research through social media. We developed a Facebook application where participants can sign up, take surveys related to their health, and easily invite interested friends to join. After completing a required number of these surveys, we send participants a spit kit to collect their DNA. In a statistical analysis of 27,000 individuals from all over the United States genotyped in our study, we replicated health trends and genetic associations, showing the utility of our approach and accuracy of self-reported phenotypes we collected. In Chapter 3, we introduce VICES (Verify Intensity Contamination from Estimated Sources), a statistical method for joint estimation of DNA contamination and its sources in genotyping arrays. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. VICES jointly estimates the total proportion of contaminating DNA and identify which samples it came from by regressing deviations in probe intensity for a sample being tested on the genotypes of another sample. Through analysis of array intensity and genotype data from HapMap samples and the Michigan Genomics Initiative, we show that our method reliably estimates contamination more accurately than existing methods and implicates problematic steps to guide process improvements. In Chapter 4, we propose Population Linkage, a novel approach to perform linkage analysis on genome-wide genotype data from tens of thousands of arbitrarily related individuals. Our method estimates kinship and identical-by-descent segments (IBD) between all pairs of individuals, fits them as variance components using Haseman-Elston regression, and tests for linkage. This chapter addresses how to iteratively assess evidence of linkage in large numbers of individuals across the genome, reduce repeated calculations, model relationships without pedigrees, and determine segregation of genomic segments between relatives using single-nucleotide polymorphism (SNP) genotypes. After applying our method to 6,602 individuals from the National Institute on Aging (NIA) SardiNIA study and 69,716 individuals from the TrĂžndelag Health Study (HUNT), we show that most of our signals overlapped with known GWAS loci and many of these could explain a greater proportion of the trait variance than the top GWAS SNP. In Chapter 5, we discuss the impact and future directions for the work presented in this dissertation. We have proposed novel approaches for gathering useful research data, checking its quality, and detecting associations in the investigation of human genetics. Also, this work serves as an example for thinking about the process of human genetic discovery from beginning to end as a whole and understanding the role of each part.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162998/1/gzajac_1.pd

    Arrays and beyond: Evaluation of marker technologies for chicken genomics

    Get PDF
    Eine zentrale Forschungsfrage in der Nutztierforschung ist, wie die phĂ€notypische Vielfalt von Nutztieren durch ihre genomische Vielfalt geprĂ€gt wird. Die genomische Vielfalt wird dabei durch genomische Marker beschrieben. Die Verwendung und Definition von genomischen Markern ist stark technologieabhĂ€ngig und Ă€ndert sich daher im Laufe der Zeit. In den letzten Jahren haben sich Einzelnukleotidpolymorphismen (SNPs) zur wichtigsten Markerklasse entwickelt. Außerdem waren SNP-Arrays in den letzten Jahren aufgrund ihrer frĂŒhen VerfĂŒgbarkeit die Genotypisierungstechnologie der Wahl. Sie werden jedoch derzeit teilweise durch die Ganzgenomsequenzierung (WGS) zur SNP-Bestimmung verdrĂ€ngt. DarĂŒber hinaus rĂŒcken Strukturelle Varianten (SV) mehr und mehr in den Fokus der Forschung. In diesem Zusammenhang zielt die vorliegende Arbeit darauf ab, die Aussagekraft von SNP-Markern auf verschiedene Weise zu bewerten, wobei der Schwerpunkt auf HĂŒhnern als einer vielfĂ€ltigen Nutztierart mit großer landwirtschaftlicher Bedeutung liegt. In Kapitel 1 wird der aktuelle Wissensstand ĂŒber genomische Variation, Markertechnologien und deren Einsatz in der Nutztierwissenschaft, insbesondere bei HĂŒhnern, dargestellt. Kapitel 2 und 3 befassen sich dann mit einem systematischen Fehler von SNP-Arrays, dem SNP Ascertainment Bias. Der SNP Ascertainment Bias ist eine systematische Verschiebung des Allelfrequenzspektrums von SNP-Arrays hin zu hĂ€ufigeren SNPs aufgrund der Vorauswahl von SNPs in einer begrenzten Anzahl von Individuen aus wenigen Populationen. Kapitel 2 zielt darauf ab, das Ausmaß des Bias fĂŒr einen Standard-SNP-Array fĂŒr HĂŒhner und die Schritte des Array-Designs, die den Bias verursacht haben, zu bewerten. In der Studie haben wir daher den Designprozess des HĂŒhnerarrays auf der Grundlage von (gepoolten) WGS verschiedener HĂŒhnerpopulationen nachgestellt. Dabei zeigte sich eine sequentielle Reduktion seltener Allele wĂ€hrend des Designprozesses, die vor allem durch die anfĂ€ngliche Begrenzung des Discovery Sets und eine spĂ€tere Selektion von hĂ€ufigen SNPs innerhalb der Populationen bei gleichzeitigem anstreben von Ă€quidistanten AbstĂ€nden verursacht wurde. Eine VergrĂ¶ĂŸerung des Discovery Panels hatte den grĂ¶ĂŸten Einfluss auf eine Begrenzung des Ascertainment Bias. Andere Schritte, wie z. B. die Validierung der SNPs in einem breiteren Set von Populationen, zeigten keine relevanten Auswirkungen. Korrekturmethoden fĂŒr den Ascertainment Bias sind in Studien bisher meist nicht durchfĂŒhrbar. In Kapitel 3 wird daher vorgeschlagen, die Imputation der Array-Daten auf WGS-Niveau als in silico Korrekturmethode fĂŒr das Allelfrequenzspektrum zu verwenden. Die Studie zeigte, dass die Imputation in der Lage ist, die Auswirkungen von Erhebungsfehlern stark zu reduzieren, selbst wenn ein sehr kleines Referenzpanel verwendet wurde. Es wurde jedoch auch deutlich, dass das Referenzpanel dann den gleichen Effekt wie das Discovery-Panel wĂ€hrend des Array-Designs hat. Daher ist es von entscheidender Bedeutung, dass die Proben fĂŒr das Referenzpanel gleichmĂ€ĂŸig ĂŒber das Populationsspektrum verteilt ausgewĂ€hlt werden. SVs sind schwieriger zu bestimmen und zu genotypisieren als SNPs. Daher stellt sich die Frage, ob die Effekte von SV auch durch SNP-basierte Studien erfasst werden. Das wĂ€re der Fall, wenn zwischen SNPs und SVs ein starkes Kopplungsungleichgewicht (LD) besteht. Dies wird in Kapitel 4 fĂŒr drei kommerzielle HĂŒhnerrassen auf der Grundlage von WGS-Daten untersucht. Die Studie zeigte, dass das LD zwischen Deletionen und SNPs auf dem gleichen Niveau lag wie das LD zwischen SNPs und anderen SNPs, was darauf hindeutet, dass Effekte von Deletionen von SNP-Marker-Panels genauso gut erfasst werden wie SNP-Effekte. Das LD zwischen SNPs und anderen SVs war stark reduziert. Der Hauptfaktor fĂŒr diese Verringerung waren lokale Unterschiede zu SNPs in Bezug auf die Minor-Allel-Frequenz. Eine Reduktion der homozygoten Varianten fĂŒr Nicht-Deletions-SVs im Vergleich zur Erwartung unter Hardy-Weinberg-Gleichgewicht kann jedoch auf Probleme der verwendeten SV-Genotypisierer hinweisen. Im letzten Kapitel (Kapitel 5) werden die Auswirkungen des Ascertainment Bias und die Möglichkeiten, damit in der HĂŒhnergenomforschung (und auch generell in der Nutztiergenomforschung) umzugehen, diskutiert. Außerdem werden die Möglichkeiten der Einbeziehung von SV in Studien bewertet. Es wird auch erörtert, was notwendig ist, um die Informationen aus verschiedenen genomischen DatensĂ€tzen zu kombinieren damit der Aussagewert von Studien erhöht wird. Abschließend wird ein Ausblick darauf gegeben, welche Informationen aufgrund der jĂŒngsten technologischen Fortschritte in naher Zukunft zusĂ€tzlich verfĂŒgbar sein werden.A key research question in livestock research is how livestock’s phenotypic diversity is shaped by its genomic diversity. Genomic diversity is thereby assessed through genomic markers. The use and definition of genomic markers is strongly technology driven and therefore changes through time. During the last years, single nucleotide polymorphisms (SNPs) have become the main marker class. Additionally, SNP arrays have been the genotyping technology of choice during the last years due to their early availability. They are, however, currently partially displaced by whole-genome-sequencing (WGS) for SNP calling. Further, structural variants (SV) are moving more and more into the focus of researchers. In this context, the thesis aims in evaluating the value of SNP markers in various ways with its main focus on chickens as a diverse livestock species with major agricultural value. In Chapter 1, the current knowledge of genomic variation, marker technologies, and their use in livestock sciences, especially in chickens, is reviewed. Chapter 2 and 3 then address a systematic error of SNP arrays, the SNP ascertainment bias. SNP ascertainment bias is a systematic shift of the allele frequency spectrum of SNP arrays towards more common SNPs due to the pre-selection of SNPs in a limited number of individuals of few populations. Chapter 2 aims in assessing the magnitude of the bias for a standard chicken SNP array and the steps of array design that created the bias. In the study, we therefore remodeled the design process of the chicken array based on (pooled) WGS of various chicken populations. This revealed a sequential reduction of rare alleles during the design process, which was mainly caused by the initial limitation of the discovery set and a later within-population selection of common SNPs while aiming for equidistant spacing. Increasing the discovery set had the largest impact on limiting ascertainment bias. Other steps, as e.g. validation of the SNPs in a broader set of populations did not show relevant effects. Correction methods for ascertainment bias are by now often unfeasible in studies. Chapter 3 therefore proposes to use imputation of the array data to WGS level as an in silico correction method of the allele frequency spectrum. The study revealed that imputation is able to strongly reduce the effects of ascertainment bias, even when a very sparse reference panel was used. However, it became also obvious that the reference panel then has the same effect as the discovery panel during array design. It is therefore crucial to select samples for the reference panel evenly spaced across the intended range of populations. SVs are harder to call and genotype than SNPs. Therefore, the question arises whether effects of SV are captured by SNP-based studies due to strong linkage disequilibrium between SNPs and SVs. This is assessed in Chapter 4 for three commercial chicken breeds, based on WGS data. The study showed that LD between deletions and SNPs was on the same level as LD between SNPs and other SNPs, indicating that deletion effects are captured by SNP marker panels as good as SNP effects. LD between SNPs and other SVs was strongly reduced. The main factor for this reduction was local differences to SNPs in terms of minor allele frequency. However, a reduction of homozygous variant calls for non-deletion SVs compared to the Hardy-Weinberg-expectation may indicate problems of the used SV genotypers. In the last chapter (Chapter 5), the impact of ascertainment bias and possibilities to deal with it in chicken genomics (and also more general in livestock genomics) is discussed. Further, the potentials of including SVs into studies are evaluated. It also discusses what is necessary to combine the information of different genomic data sets to leverage the value of analyses. Finally, an outlook on what information will be additionally available in near future based on recent technological advances is given.2022-01-1

    CONTRIBUTION TO LINKAGE AND ASSOCIATION MAPPING OF TRAIT LOCI IN LIVESTOCK.

    Full text link
    Until recently, breeding values were estimated based on phenotypes measured on the individual and its relatives, and the notion that the covariance between breeding values is proportionate to the kinship coefficient. Advances in genomics now allow for direct analysis of the genome and identification of the loci that determine the breeding values of individuals. As a consequence, marker assisted selection and genomic selection have become more effective and are replacing conventional selection. The identification of loci influencing the traits of interest requires the use of advanced statistical methods that are constantly evolving. In the context of this thesis, we have (i) contributed to the development of gene mapping methods, (ii) applied these methods to map loci influencing both metric and meristic traits, and (iii) contributed to the development of methods for the integration of genomic information in livestock breeding and management. The mapping methods that we have helped developing distinguish themselves mainly by the fact that (i) they exploit haplotype information (by means of a hidden markov model) which should increase the linkage disequilibrium with causative variants and hence detection power, (ii) they can simultaneously extract linkage information within families, and linkage disequilibrium information across the population, and (iii) they correct for population stratification by means of a random polygenic effect, and (iv) they can be applied to binary as well as quantitative traits. We have applied these and other methods to map loci influencing (i) quantitative hematological parameters in a porcine line-cross, and (ii) binary traits including diseases in bovine and non-syntenic Copy Number Variants in cattle, horse and human. In fine, we have contributed to the development of methods for the utilization of marker information in animal selection and production. We have extended the haplotype-based mapping method to allow imputation and have evaluated the utility of this approach in scenarios mimicking reality. We have also contributed to the development of a method to quantify somatic cell counts in the milk of individual cows by genotyping a sample of milk from the farm’s tank (hence a mixture of milk from all cows on the farm) Our work has resulted in the development of a software package (“GLASCOW”) that is increasingly used by the community to map genes influencing complex traits, primarily binary. By using this tool, we have contributed to the localization of several trait loci in pig, cattle, horse and human. We have contributed to the development of approaches that reduce the costs of genomic analyses in livestock by, on the one hand, complementing real SNP genotypes with genotypes obtained in silico by means imputation, and, on the other hand, by developing a method to deconvolute genotypes obtained on DNA pools

    Analyse de la variation nucléotidique et structurale chez le soja par une approche de re-séquençage

    Get PDF
    Le sĂ©quençage de nouvelle gĂ©nĂ©ration (NGS) a rĂ©volutionnĂ© la recherche chez les plantes et les animaux de plusieurs façons, y compris via le dĂ©veloppement de nouvelles mĂ©thodes de gĂ©notypage Ă  haut dĂ©bit pour accĂ©lĂ©rer considĂ©rablement l'Ă©tude de la composition des gĂ©nomes et de leurs fonctions. Dans le cadre du projet SoyaGen, financĂ© par GĂ©nome Canada, nous cherchons Ă  mieux comprendre la diversitĂ© gĂ©nĂ©tique et l'architecture sous-jacente rĂ©gissant les principaux caractĂšres agronomiques chez le soja. Le soja est la plus importante culture olĂ©agineuse au monde en termes Ă©conomiques. Dans cette Ă©tude, nous avons cherchĂ© Ă  exploiter les technologies NGS afin de contribuer Ă  l'Ă©lucidation des caractĂ©ristiques gĂ©nomiques du soja. Pour ce faire, trois axes de recherche ont formĂ© le cƓur de cette thĂšse : 1) le gĂ©notypage pan-gĂ©nomique Ă  faible coĂ»t, 2) la caractĂ©risation exhaustive des variants gĂ©nĂ©tiques par resĂ©quençage complet et 3) l’identification de mutations Ă  fort impact fonctionnel sur la base d’une forte sĂ©lection au sein des lignĂ©es Ă©lites. Un premier dĂ©fi en analyse gĂ©nĂ©tique ou gĂ©nomique est de rendre possible une caractĂ©risation rapide et peu coĂ»teuse d’un grand nombre de lignĂ©es Ă  un trĂšs grand nombre de marqueurs rĂ©partis sur tout le gĂ©nome. Le gĂ©notypage par sĂ©quençage (GBS) permet d'effectuer simultanĂ©ment l’identification et le gĂ©notypage de plusieurs milliers de SNP Ă  l'Ă©chelle du gĂ©nome. Un des grands dĂ©fis en analyse GBS est d’extraire, d’une montagne de donnĂ©es issues du sĂ©quençage, un grand catalogue de SNP de haute qualitĂ© et de minimiser l’impact des donnĂ©es manquantes. Dans une premiĂšre Ă©tape, nous avons grandement amĂ©liorĂ© le GBS en dĂ©veloppant un nouveau pipeline d’analyse bio-informatique, Fast-GBS, conçu pour produire un appel de gĂ©notypes plus prĂ©cis et plus rapide que les outils existants. De plus, nous avons optimisĂ© des outils permettant d’effectuer l'imputation des donnĂ©es manquantes. Ainsi, nous avons pu obtenir un catalogue de 60K marqueurs SNP au sein d’une collection de 301 accessions qui se voulait reprĂ©sentative de la diversitĂ© du soja au Canada. Dans un second temps, toutes les donnĂ©es manquantes (~50%) ont Ă©tĂ© imputĂ©es avec un trĂšs grand degrĂ© d’exactitude (98 %). Cette caractĂ©risation gĂ©nĂ©tique a Ă©tĂ© rĂ©alisĂ©e pour un coĂ»t modique, soit moins de 15parligneˊe.Deuxieˋmement,pourcaracteˊriserdemanieˋreexhaustivelesvariationsnucleˊotidiquesetstructurelles(SNVetSV,respectivement)danslegeˊnomedusoja,nousavonsseˊquenceˊlegeˊnomeentierde102accessionsdesojaauCanada.Nousavonsidentifieˊpreˋsde5Mdevariantsnucleˊotidiques(SNP,MNPetIndels)avecunhautniveaud’exactitude(98,6 par lignĂ©e. DeuxiĂšmement, pour caractĂ©riser de maniĂšre exhaustive les variations nuclĂ©otidiques et structurelles (SNV et SV, respectivement) dans le gĂ©nome du soja, nous avons sĂ©quencĂ© le gĂ©nome entier de 102 accessions de soja au Canada. Nous avons identifiĂ© prĂšs de 5M de variants nuclĂ©otidiques (SNP, MNP et Indels) avec un haut niveau d’exactitude (98,6 %). Ensuite, en utilisant une combinaison de trois approches diffĂ©rentes, nous avons dĂ©tectĂ© ~92K SV (dĂ©lĂ©tions, insertions, inversions, duplications, CNV et translocations) et estimĂ© que plus de 90 % Ă©taient exacts. C'est la premiĂšre fois qu'une description complĂšte de la diversitĂ© des haplotypes SNP et du SV a Ă©tĂ© rĂ©alisĂ©e chez une espĂšce cultivĂ©e. Enfin, nous avons mis au point une approche analytique systĂ©matique pour faciliter grandement l’identification de gĂšnes dont des allĂšles ont fait l’objet d’une trĂšs forte sĂ©lection au cours de la domestication et de la sĂ©lection. Cette approche repose sur deux progrĂšs rĂ©cents en gĂ©nomique : 1) le sĂ©quençage de gĂ©nomes entiers et 2) la prĂ©diction des mutations entraĂźnant une perte de fonction (LOF pour « loss of function »). En utilisant cette approche, nous avons identifiĂ© 130 gĂšnes candidats liĂ©s Ă  la domestication ou Ă  la sĂ©lection chez le soja. Ce catalogue contient tous les gĂšnes de domestication prĂ©cĂ©demment caractĂ©risĂ©s chez le soja, ainsi que certains orthologues chez d'autres espĂšces cultivĂ©es. Cette liste de gĂšnes fournit de nombreuses pistes d’investigation pour des Ă©tudes visant Ă  mieux comprendre les gĂšnes qui contribuent fortement Ă  façonner le soja cultivĂ©. Cette thĂšse permet ultimement une meilleure comprĂ©hension des caractĂ©ristiques gĂ©nomiques du soja. En outre, elle fournit plusieurs outils et rĂ©fĂ©rences gĂ©nomiques qui pourraient facilement ĂȘtre utilisĂ©s dans de futures recherches en gĂ©nomique chez le soja de mĂȘme que chez d’autres espĂšces.Next-generation sequencing (NGS) has revolutionized plants and animals research in many ways, including the development of new high-throughput genotyping methods to accelerate considerably the composition of genomes and their functions. As part of the SoyaGen project, funded by Genome Canada, we are seeking to better understand the genetic diversity and underlying architecture governing major agronomic traits in soybeans. Soybean is the world's largest oilseed crop in economic terms. In this study, we sought to exploit NGS technologies to help elucidate the genomic characteristics of soybeans. To this end, three main research topics have formed the core of this thesis: 1) low-cost genome-wide genotyping, 2) exhaustive characterization of genetic variants by whole-genome resequencing, and 3) identification of mutations with high functional impact on the basis of a strong selection within the elite lines. A first challenge in genetic or genomic analysis is to make possible a rapid and inexpensive characterization of a large number of lines with a very large number of markers distributed throughout the genome. Genotyping-by-sequencing (GBS) allows simultaneous identification and genotyping of several thousand SNPs on a genome-wide scale. One of the major challenges in GBS analysis is to extract a large catalog of high quality SNP from a mountain of sequencing data and minimize the impact of missing data. As a first step, we have greatly improved the GBS by developing a new bio-informatics analysis pipeline, Fast-GBS, designed to produce a more accurate and faster call of genotypes than existing tools. In addition, we have optimized tools for imputing missing data. For example, we were able to obtain a catalog of 60K SNP markers from a collection of 301 accessions that were representative of soybean diversity in Canada. Second, all missing data (~ 50%) were imputed with a very high degree of accuracy (98%). This genetic characterization was performed at a low cost, less than 15 per line. Second, to fully characterize the nucleotide and structural variations (SNV and SV, respectively) in the soybean genome, we sequenced the whole genome of 102 Canadian soybean accessions. We have identified nearly 5M of nucleotide variants (SNP, MNP and Indels) with a high level of accuracy (98.6%). Then, using a combination of three different approaches, we detected ~ 92K SV (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that more than 90% were accurate. This is the first time that a complete description of the diversity of SNP and SV haplotypes has been carried out in a cultivated species. Finally, we have developed a systematic analytical approach to greatly facilitate the identification of genes whose alleles have undergone a very strong selection during domestication and selection. This approach is based on two recent advances in genomics: (1) whole-genome sequencing and (2) predicting mutations resulting in loss of function (LOF). Using this approach, we identified 130 candidate genes related to domestication or selection in soybean. This catalogue contains all of the previously well-characterized domestication genes in soybean, as well as some orthologues from other domesticated crop species. This list of genes provides many avenues of investigation for studies aimed at better understanding the genes that contribute strongly to shaping cultivated soybeans. This thesis ultimately leads to a better understanding of the genomic characteristics of soybeans. In addition, it provides several tools and genomic resources that could easily be used in future genomic research in soybeans as well as in other species

    Application of genomic technologies to the horse

    Get PDF
    The publication of a draft equine genome sequence and the release by Illumina of a 50,000 marker single-nucleotide polymorphism (SNP) genotyping chip has provided equine researchers with the opportunity to use new approaches to study the relationships between genotype and phenotype. In particular, it is hoped that the use of high-density markers applied to population samples will enable progress to be made with regard to more complex diseases. The first objective of this thesis is to explore the potential for the equine SNP chip to enable such studies to be performed in the horse. The second objective is to investigate the genetic background of osteochondrosis (OC) in the horse. These objectives have been tackled using 348 Thoroughbreds from the US, divided into cases and controls, and a further 836 UK Thoroughbreds, the majority with no phenotype data. All horses had been genotyped with the Illumina Equine SNP50 BeadChip. Linkage disequilibrium (LD) is the non-random association of alleles at neighbouring loci. The reliance of many genomic methodologies on LD between neutral markers and causal variants makes it an important characteristic of genome structure. In this thesis, the genomic data has been used to study the extent of LD in the Thoroughbred and the results considered in terms of genome coverage. Results suggest that the SNP chip offers good coverage of the genome. Published theoretical relationships between LD and historical effective population size (Ne) were exploited to enable accuracy predictions for genome-wide evaluation (GWE) to be made. A subsequent in-depth exploration of this theory cast some doubt on the reliability of this approach in the estimation of Ne, but the general conclusion that the Thoroughbred population has a small Ne which should enable GWE to be carried out efficiently in this population, remains valid. In the course of these studies, possible errors embedded within the current sequence assembly were identified using empirical approaches. Osteochondrosis is a developmental orthopaedic disease which affects the joints of young horses. Osteochondrosis is considered multifactorial in origin with a variety of environmental factors and heredity having been implicated. In this thesis, a genome-wide association study was carried out to identify quantitative trait loci (QTL) associated with OC. A single SNP was found to be significantly associated with OC. The low heritability of OC combined with the apparent lack of major QTL suggests GWE as an alternative approach to tackle this disease. A GWE analysis was carried out on the same dataset but the resulting genomic breeding values had no predictive ability for OC status. This, combined with the small number of significant QTL, indicates a lack of power which could be addressed in the future by increasing sample size. An alternative to genotyping more horses for the 50K SNP chip would be to use a low-density SNP panel and impute remaining genotypes. The final chapter of this thesis examines the feasibility of this approach in the Thoroughbred. Results suggest that genotyping only a subset of samples at high density and the remainder at lower density could be an effective strategy to enable greater progress to be made in the arena of equine genomics. Finally, this thesis provides an outlook on the future for genomics in the horse.L.J. Corbin, J.A. Woolliams “Data relating to Laura Corbin PhD” (2016) Edinburgh DataVault [see 2nd link below
    corecore