50 research outputs found
Haplotype estimation in polyploids using DNA sequence data
Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
Recommended from our members
Quantifying recent variation and relatedness in human populations
Advances in the genetic analysis of humans have revealed a surprising abundance of local relatedness between purportedly unrelated individuals. Where common mutations classically inform us of ancient relationships, such segments of pairwise identical by descent (IBD) sharing from a common ancestor are the observable traces of recent inter-mating. Combining these two distinct sources of information can help disentangle the complex genetic structure and flux in human populations. When considered together with a heritable trait, the segments can also be used to interrogate unascertained rare variation and help in locating trait-effecting loci. This work presents methods for comprehensive analysis of population-wide IBD and explores applications to disease and the understanding of recent genetic variation. We propose several strategies for efficient detection of IBD segments in population genotype data. Our novel seed-based algorithm, GERMLINE, can reduce the computational burden of finding pairwise segments from quadratic to nearly linear time in a general population. We demonstrate that this approach is several orders of magnitude faster than the available all-pairs methods while maintaining higher accuracy. Next, we extended the GERMLINE technique to process cohorts of unlimited size by adaptively adjusting the search mechanism to meet resource restrictions. We confirm its effectiveness with an analysis of 50,000 individuals where contemporary methods can only process a few thousand. One draw-back of these two algorithms is the dependence on phased haplotype data as input - a constraint that becomes more difficult with large populations. We propose a solution to this problem with an algorithm that analyzes genotype data directly by exploring all potential haplotypes and scoring each putative segment based on linkage-disequilibrium. This solution significantly outperforms available methods when applied to full sequence data and is computationally efficient enough to analyze thousands of sequenced genomes where current methods can only determine haplotypes for several hundred. Secondly, we outline two algorithms for analyzing available IBD segments to increase our understanding of rare variation and complex disease. Motivated by whole-genome sequencing, we present the INFOSTIP algorithm, which uses IBD segments to optimize the selection of individuals for complete population ascertainment. In simulations, we show that INFOSTIP selection can significantly increase variant inference accuracy over random sampling and posit inference of 60% of an isolated population from 1% optimally selected individuals. Seeking to move beyond pairwise IBD segment analysis, we describe the DASH algorithm, which groups shared segments into IBD "clusters" that are likely to be commonly co-inherited and uses them as proxies for un-typed variation. In simulated disease studies, we show this reference-free approach to be much more powerful for detecting rare causal variants than either traditional single-marker analysis or imputation from a general reference panel. Applying the DASH algorithm to disease traits from different populations, we identify multiple novel loci of association. Together, these novel techniques integrate the power of population and disease genetics
Genotype imputation as a genomic strategy for the SA Drakensberger beef breed
Indigenous breeds such as the South African (SA) Drakensberger are economically important genetic resources in local beef production because of their adaptive traits and ability to perform competitively at a commercial level. Genomic selection (GS) is a promising technology to accelerate genetic progress in traits relevant to commercial beef production. A major obstacle in applying this methodology has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNPs). Cost reduction can be achieved by exploiting genotype imputation in GS workflows by means of genotyping at lower densities and imputing upwards. The overarching aim of this study was to conduct an investigation into the practicality of applying imputation in such a workflow utilizing genotypic data for 1 135 SA Drakensberger animals genotyped for 139 480 SNPs. As a pre-imputation step, the objective was firstly to elucidate inter- and intra-chromosomal patterns in genomic characteristics that may contribute to variability in achievable imputation accuracy across the genome. Inter-chromosomal differences in the proportion of low minor allele frequency (MAF) SNPs estimated varied from 6.6% for Bos Taurus autosome (BTA) 23 to 16.0% for BTA14. Pairwise linkage disequilibrium (LD), between adjacent SNPs, ranged from r2=0.11 (BTA28) to 0.17 (BTA14). The largest run of homozygosity (ROH), located on BTA13, was 225.82 kilobases (kb) in length and was identified in 23% of the animals sampled. The ROH-based inbreeding coefficients (FROH) estimated (e.g. FROH>1Mb=0.07, where FROH>1Mb denotes FROH calculated for all ROH longer than 1 megabase pair), indicated sufficient within-breed relatedness to achieve accurate imputation. During the imputation step, imputation accuracy from several custom-derived lower density panels varying in SNP density and the SNP selection strategy were compared. Imputation accuracy increased as SNP density increased; a genotyping panel consisting of 10 000 SNPs, selected based on a combination of their MAF and LD with neighbouring SNPs, could be used to achieve <3% imputation error on average. At this density of SNPs, a mean correlation coefficient (±standard deviation) between true- and imputed SNPs of 0.972±0.024 was achieved in a set of validation animals (n=235). Low MAF SNPs were imputed with lesser accuracy; a difference of 0.071 units was observed between the mean accuracy of imputed SNP categorized into low- (0.01<MAFâ€0.1) versus high MAF (0.4<MAF<0.5) classes. Post-imputation, the utility of imputed genotypes in genomic breeding value (GEBV) estimation was evaluated by comparing prediction accuracies achieved from the use of true versus imputed SNPs in generating the H-inverse matrix applied in single-step GS. Breeding values were estimated for two growth traits, considering direct and maternal components. Prediction accuracies were improved by using genomic information in addition to traditional pedigree information; the largest improvement (0.026 units increase in accuracy) was observed for maternal birth weight. Marginal differences were observed between GEBV accuracies produced from true (GEBV_TRUE) versus imputed genotypes (GEBV_IMPUTED); for example the mean±standard deviation in GEBV_TRUE=0.774±0.056 versus GEBV_IMPUTED=0.773±0.055 accuracy was observed for direct birth weight, suggesting that imputation errors had an almost negligible influence. Results presented in this thesis demonstrated the usefulness of imputation as a viable genomic strategy towards low-cost implementation of genomically enhanced prediction of EBVs for a breed such as the SA Drakensberger.Thesis (PhD)--University of Pretoria, 2020.Animal and Wildlife SciencesPhD (Animal Science)Unrestricte
Genomic evaluation considering the mosaic genome of the crossbred pig
In pigs, the breeding goal is to improve performance of crossbred (CB) animals in commercial farms. The best purebred (PB) animals to produce CB animals can be selected based on their genomic estimated breeding value (GEBV) for CB performance. GEBVs are the result of combining estimated effects of single nucleotide polymorphisms (SNPs) with the animalâs genotype. Using CB genomic information allows to estimate SNP allele effects accounting for the CB genetic background. The genome of CB animals is a mosaic of genomic regions inherited from the different parental breeds, therefore, this thesis aimed to investigate whether SNP alleles have different effects depending from which parental breed the allele was inherited and study the impact on GEBV of PB animals for CB performance when breed-specific allele effects were taken into consideration. Firstly, I showed that around 94 % of alleles of three-way CB pigs can be assigned a breed of origin. Knowing this, allowed me to implement a model that accounts for breed-specific effects of all SNP alleles. Using results of this model, I showed that estimated effects and explained variance of SNPs strongly associated with CB performance are different depending upon from which parental breed they were inherited, however, the majority of the genomic regions are not or only weakly associated with CB performance. Therefore, I implemented a new model that allows to estimate breed-specific effects only for alleles of SNPs strongly associated with CB performance, and for the rest of the SNPs assumes that allele effects are the same across breeds. Differences of prediction accuracies between models were generally small. When the estimated genetic correlation between the performance of PB and CB animals per breed of origin differed largely across models, it was better to use models that make a distinction of alleles according to their breed of origin and whether or not they were associated to the trait.</p
Methods and Applications for Collection, Contamination Estimation, and Linkage Analysis of Large-scale Human Genotype Data
In recent decades statistical genetics has contributed substantially to our knowledge of human health and biology. This research has many facets: from collecting data, to cleaning, to analyzing. As the scope of the scientific questions considered and the scale of the data continue to increase, these bring additional challenges to every step of the process. In this dissertation, I describe novel approaches for each of these three steps, focused on the specific problems of participant recruitment and engagement, DNA contamination estimation, and linkage analysis with large data sets. In Chapter 1, we introduce the subject of this dissertation and how it fits with other developments in the generation, analysis and interpretation of human genetic data.
In Chapter 2, we describe Genes for Good, a new platform for engaging a large, diverse participant pool in genetics research through social media. We developed a Facebook application where participants can sign up, take surveys related to their health, and easily invite interested friends to join. After completing a required number of these surveys, we send participants a spit kit to collect their DNA. In a statistical analysis of 27,000 individuals from all over the United States genotyped in our study, we replicated health trends and genetic associations, showing the utility of our approach and accuracy of self-reported phenotypes we collected.
In Chapter 3, we introduce VICES (Verify Intensity Contamination from Estimated Sources), a statistical method for joint estimation of DNA contamination and its sources in genotyping arrays. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. VICES jointly estimates the total proportion of contaminating DNA and identify which samples it came from by regressing deviations in probe intensity for a sample being tested on the genotypes of another sample. Through analysis of array intensity and genotype data from HapMap samples and the Michigan Genomics Initiative, we show that our method reliably estimates contamination more accurately than existing methods and implicates problematic steps to guide process improvements.
In Chapter 4, we propose Population Linkage, a novel approach to perform linkage analysis on genome-wide genotype data from tens of thousands of arbitrarily related individuals. Our method estimates kinship and identical-by-descent segments (IBD) between all pairs of individuals, fits them as variance components using Haseman-Elston regression, and tests for linkage. This chapter addresses how to iteratively assess evidence of linkage in large numbers of individuals across the genome, reduce repeated calculations, model relationships without pedigrees, and determine segregation of genomic segments between relatives using single-nucleotide polymorphism (SNP) genotypes. After applying our method to 6,602 individuals from the National Institute on Aging (NIA) SardiNIA study and 69,716 individuals from the TrĂžndelag Health Study (HUNT), we show that most of our signals overlapped with known GWAS loci and many of these could explain a greater proportion of the trait variance than the top GWAS SNP.
In Chapter 5, we discuss the impact and future directions for the work presented in this dissertation. We have proposed novel approaches for gathering useful research data, checking its quality, and detecting associations in the investigation of human genetics. Also, this work serves as an example for thinking about the process of human genetic discovery from beginning to end as a whole and understanding the role of each part.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162998/1/gzajac_1.pd
Recommended from our members
Computational methods for single cell RNA and genome assembly resolution using genetic variation
Genetic variation and natural selection have driven the evolutionary history on this planet and are responsible for creating us and all other life as we know it. Over the past several decades, the genomic revolution has allowed us to assess population variation across humans and other species and use that to link genotypes with phenotypes and infer evolutionary histories. In this thesis, I explore computational methods for using genetic variation to demultiplex and disambiguate complex data.
In single cell RNAseq, problems of batch effects, doublets, and ambient RNA are each sources of noise that impede our ability to infer the functional states of cells and compare them between experiments. One new popular new experimental design promising to solve each of these while also reducing experimental costs is mixturing multiple individuals' cells into a single experiment. In chapter 2, I present a method for clustering cells by genotype, calling doublets, and using the cross-genotype signal in singletons to estimate and remove ambient RNA. I compare this methods to other existing methods including one that requires \textit{a priori} information about the genotypes, and two which do not. I find that my method outperforms each of these methods across a wide range of data parameters and sample types.
In genome assembly, the recent higher throughput and lower cost of long read sequencing has revolutionized our ability to create reference quality genomes and has revitalized the assembly community. Now, massive efforts are taking place in the Darwin Tree of Life project and the Earth Biogenome project to create reference genomes for all multicelular eukaryotic life. This will create a scientific resource for the next generation of biological science, will serve as a conservation of data that could otherwise be lost in this time of mass extinction, and will allow for a much more broad understanding of evolution and the evolutionary history of life on Earth. While much progress has been made in data quality and assembly algorithms, some problems still exist. Until recently, the DNA input requirements for long read sequencing technologies made it impossible to sequence single individuals of these species with long reads. Also, high heterozygosity makes assembly more difficult due to the inherent ambiguity between heterozygous sequence versus paralogous sequence when confronted with inexact homology. One solution to the DNA input requirements would be to pool individuals, but this only increases the heterozygosity of the sample and reduces assembly quality. In chapter 3, we present the first high quality assembly of a single mosquito using new library preparation methods with reduced DNA requirements. This reduces the number of haplotypes to two, improving the assembly quality. In chapter 4, we further address the problems brought on by heterozygosity in assembly. I present a suite of tools that use the phasing consistency of multiple heterozygous sequences as a signal for physical linkage, thus using genetic variation to our advantage rather than as a challenge to overcome. This tool creates phased, linked assemblies and phasing aware scaffolding. Further, I provide a tool for phasing aware scaffolding on existing assemblies. This includes a novel haplotype phasing algorithm with some unique beneficial properties. It is robust to non-heterozygous variants as input and can detect and correct those genotypes. And it naturally extends to polyploid genomes.Wellcome Trus
Arrays and beyond: Evaluation of marker technologies for chicken genomics
Eine zentrale Forschungsfrage in der Nutztierforschung ist, wie die phĂ€notypische Vielfalt von Nutztieren durch ihre genomische Vielfalt geprĂ€gt wird. Die genomische Vielfalt wird dabei durch genomische Marker beschrieben. Die Verwendung und Definition von genomischen Markern ist stark technologieabhĂ€ngig und Ă€ndert sich daher im Laufe der Zeit. In den letzten Jahren haben sich Einzelnukleotidpolymorphismen (SNPs) zur wichtigsten Markerklasse entwickelt. AuĂerdem waren SNP-Arrays in den letzten Jahren aufgrund ihrer frĂŒhen VerfĂŒgbarkeit die Genotypisierungstechnologie der Wahl. Sie werden jedoch derzeit teilweise durch die Ganzgenomsequenzierung (WGS) zur SNP-Bestimmung verdrĂ€ngt. DarĂŒber hinaus rĂŒcken Strukturelle Varianten (SV) mehr und mehr in den Fokus der Forschung. In diesem Zusammenhang zielt die vorliegende Arbeit darauf ab, die Aussagekraft von SNP-Markern auf verschiedene Weise zu bewerten, wobei der Schwerpunkt auf HĂŒhnern als einer vielfĂ€ltigen Nutztierart mit groĂer landwirtschaftlicher Bedeutung liegt. In Kapitel 1 wird der aktuelle Wissensstand ĂŒber genomische Variation, Markertechnologien und deren Einsatz in der Nutztierwissenschaft, insbesondere bei HĂŒhnern, dargestellt. Kapitel 2 und 3 befassen sich dann mit einem systematischen Fehler von SNP-Arrays, dem SNP Ascertainment Bias. Der SNP Ascertainment Bias ist eine systematische Verschiebung des Allelfrequenzspektrums von SNP-Arrays hin zu hĂ€ufigeren SNPs aufgrund der Vorauswahl von SNPs in einer begrenzten Anzahl von Individuen aus wenigen Populationen. Kapitel 2 zielt darauf ab, das AusmaĂ des Bias fĂŒr einen Standard-SNP-Array fĂŒr HĂŒhner und die Schritte des Array-Designs, die den Bias verursacht haben, zu bewerten. In der Studie haben wir daher den Designprozess des HĂŒhnerarrays auf der Grundlage von (gepoolten) WGS verschiedener HĂŒhnerpopulationen nachgestellt. Dabei zeigte sich eine sequentielle Reduktion seltener Allele wĂ€hrend des Designprozesses, die vor allem durch die anfĂ€ngliche Begrenzung des Discovery Sets und eine spĂ€tere Selektion von hĂ€ufigen SNPs innerhalb der Populationen bei gleichzeitigem anstreben von Ă€quidistanten AbstĂ€nden verursacht wurde. Eine VergröĂerung des Discovery Panels hatte den gröĂten Einfluss auf eine Begrenzung des Ascertainment Bias. Andere Schritte, wie z. B. die Validierung der SNPs in einem breiteren Set von Populationen, zeigten keine relevanten Auswirkungen. Korrekturmethoden fĂŒr den Ascertainment Bias sind in Studien bisher meist nicht durchfĂŒhrbar. In Kapitel 3 wird daher vorgeschlagen, die Imputation der Array-Daten auf WGS-Niveau als in silico Korrekturmethode fĂŒr das Allelfrequenzspektrum zu verwenden. Die Studie zeigte, dass die Imputation in der Lage ist, die Auswirkungen von Erhebungsfehlern stark zu reduzieren, selbst wenn ein sehr kleines Referenzpanel verwendet wurde. Es wurde jedoch auch deutlich, dass das Referenzpanel dann den gleichen Effekt wie das Discovery-Panel wĂ€hrend des Array-Designs hat. Daher ist es von entscheidender Bedeutung, dass die Proben fĂŒr das Referenzpanel gleichmĂ€Ăig ĂŒber das Populationsspektrum verteilt ausgewĂ€hlt werden. SVs sind schwieriger zu bestimmen und zu genotypisieren als SNPs. Daher stellt sich die Frage, ob die Effekte von SV auch durch SNP-basierte Studien erfasst werden. Das wĂ€re der Fall, wenn zwischen SNPs und SVs ein starkes Kopplungsungleichgewicht (LD) besteht. Dies wird in Kapitel 4 fĂŒr drei kommerzielle HĂŒhnerrassen auf der Grundlage von WGS-Daten untersucht. Die Studie zeigte, dass das LD zwischen Deletionen und SNPs auf dem gleichen Niveau lag wie das LD zwischen SNPs und anderen SNPs, was darauf hindeutet, dass Effekte von Deletionen von SNP-Marker-Panels genauso gut erfasst werden wie SNP-Effekte. Das LD zwischen SNPs und anderen SVs war stark reduziert. Der Hauptfaktor fĂŒr diese Verringerung waren lokale Unterschiede zu SNPs in Bezug auf die Minor-Allel-Frequenz. Eine Reduktion der homozygoten Varianten fĂŒr Nicht-Deletions-SVs im Vergleich zur Erwartung unter Hardy-Weinberg-Gleichgewicht kann jedoch auf Probleme der verwendeten SV-Genotypisierer hinweisen. Im letzten Kapitel (Kapitel 5) werden die Auswirkungen des Ascertainment Bias und die Möglichkeiten, damit in der HĂŒhnergenomforschung (und auch generell in der Nutztiergenomforschung) umzugehen, diskutiert. AuĂerdem werden die Möglichkeiten der Einbeziehung von SV in Studien bewertet. Es wird auch erörtert, was notwendig ist, um die Informationen aus verschiedenen genomischen DatensĂ€tzen zu kombinieren damit der Aussagewert von Studien erhöht wird. AbschlieĂend wird ein Ausblick darauf gegeben, welche Informationen aufgrund der jĂŒngsten technologischen Fortschritte in naher Zukunft zusĂ€tzlich verfĂŒgbar sein werden.A key research question in livestock research is how livestockâs phenotypic diversity is shaped by its genomic diversity. Genomic diversity is thereby assessed through genomic markers. The use and definition of genomic markers is strongly technology driven and therefore changes through time. During the last years, single nucleotide polymorphisms (SNPs) have become the main marker class. Additionally, SNP arrays have been the genotyping technology of choice during the last years due to their early availability. They are, however, currently partially displaced by whole-genome-sequencing (WGS) for SNP calling. Further, structural variants (SV) are moving more and more into the focus of researchers. In this context, the thesis aims in evaluating the value of SNP markers in various ways with its main focus on chickens as a diverse livestock species with major agricultural value.
In Chapter 1, the current knowledge of genomic variation, marker technologies, and their use in livestock sciences, especially in chickens, is reviewed. Chapter 2 and 3 then address a systematic error of SNP arrays, the SNP ascertainment bias. SNP ascertainment bias is a systematic shift of the allele frequency spectrum of SNP arrays towards more common SNPs due to the pre-selection of SNPs in a limited number of individuals of few populations.
Chapter 2 aims in assessing the magnitude of the bias for a standard chicken SNP array and the steps of array design that created the bias. In the study, we therefore remodeled the design process of the chicken array based on (pooled) WGS of various chicken populations. This revealed a sequential reduction of rare alleles during the design process, which was mainly caused by the initial limitation of the discovery set and a later within-population selection of common SNPs while aiming for equidistant spacing. Increasing the discovery set had the largest impact on limiting ascertainment bias. Other steps, as e.g. validation of the SNPs in a broader set of populations did not show relevant effects.
Correction methods for ascertainment bias are by now often unfeasible in studies. Chapter 3 therefore proposes to use imputation of the array data to WGS level as an in silico correction method of the allele frequency spectrum. The study revealed that imputation is able to strongly reduce the effects of ascertainment bias, even when a very sparse reference panel was used. However, it became also obvious that the reference panel then has the same effect as the discovery panel during array design. It is therefore crucial to select samples for the reference panel evenly spaced across the intended range of populations.
SVs are harder to call and genotype than SNPs. Therefore, the question arises whether effects of SV are captured by SNP-based studies due to strong linkage disequilibrium between SNPs and SVs. This is assessed in Chapter 4 for three commercial chicken breeds, based on WGS data. The study showed that LD between deletions and SNPs was on the same level as LD between SNPs and other SNPs, indicating that deletion effects are captured by SNP marker panels as good as SNP effects. LD between SNPs and other SVs was strongly reduced. The main factor for this reduction was local differences to SNPs in terms of minor allele frequency. However, a reduction of homozygous variant calls for non-deletion SVs compared to the Hardy-Weinberg-expectation may indicate problems of the used SV genotypers.
In the last chapter (Chapter 5), the impact of ascertainment bias and possibilities to deal with it in chicken genomics (and also more general in livestock genomics) is discussed. Further, the potentials of including SVs into studies are evaluated. It also discusses what is necessary to combine the information of different genomic data sets to leverage the value of analyses. Finally, an outlook on what information will be additionally available in near future based on recent technological advances is given.2022-01-1
CONTRIBUTION TO LINKAGE AND ASSOCIATION MAPPING OF TRAIT LOCI IN LIVESTOCK.
Until recently, breeding values were estimated based on phenotypes measured on the individual and its relatives, and the notion that the covariance between breeding values is proportionate to the kinship coefficient. Advances in genomics now allow for direct analysis of the genome and identification of the loci that determine the breeding values of individuals. As a consequence, marker assisted selection and genomic selection have become more effective and are replacing conventional selection.
The identification of loci influencing the traits of interest requires the use of advanced statistical methods that are constantly evolving. In the context of this thesis, we have (i) contributed to the development of gene mapping methods, (ii) applied these methods to map loci influencing both metric and meristic traits, and (iii) contributed to the development of methods for the integration of genomic information in livestock breeding and management.
The mapping methods that we have helped developing distinguish themselves mainly by the fact that (i) they exploit haplotype information (by means of a hidden markov model) which should increase the linkage disequilibrium with causative variants and hence detection power, (ii) they can simultaneously extract linkage information within families, and linkage disequilibrium information across the population, and (iii) they correct for population stratification by means of a random polygenic effect, and (iv) they can be applied to binary as well as quantitative traits.
We have applied these and other methods to map loci influencing (i) quantitative hematological parameters in a porcine line-cross, and (ii) binary traits including diseases in bovine and non-syntenic Copy Number Variants in cattle, horse and human.
In fine, we have contributed to the development of methods for the utilization of marker information in animal selection and production. We have extended the haplotype-based mapping method to allow imputation and have evaluated the utility of this approach in scenarios mimicking reality. We have also contributed to the development of a method to quantify somatic cell counts in the milk of individual cows by genotyping a sample of milk from the farmâs tank (hence a mixture of milk from all cows on the farm)
Our work has resulted in the development of a software package (âGLASCOWâ) that is increasingly used by the community to map genes influencing complex traits, primarily binary. By using this tool, we have contributed to the localization of several trait loci in pig, cattle, horse and human. We have contributed to the development of approaches that reduce the costs of genomic analyses in livestock by, on the one hand, complementing real SNP genotypes with genotypes obtained in silico by means imputation, and, on the other hand, by developing a method to deconvolute genotypes obtained on DNA pools
Analyse de la variation nucléotidique et structurale chez le soja par une approche de re-séquençage
Le sĂ©quençage de nouvelle gĂ©nĂ©ration (NGS) a rĂ©volutionnĂ© la recherche chez les plantes et les animaux de plusieurs façons, y compris via le dĂ©veloppement de nouvelles mĂ©thodes de gĂ©notypage Ă haut dĂ©bit pour accĂ©lĂ©rer considĂ©rablement l'Ă©tude de la composition des gĂ©nomes et de leurs fonctions. Dans le cadre du projet SoyaGen, financĂ© par GĂ©nome Canada, nous cherchons Ă mieux comprendre la diversitĂ© gĂ©nĂ©tique et l'architecture sous-jacente rĂ©gissant les principaux caractĂšres agronomiques chez le soja. Le soja est la plus importante culture olĂ©agineuse au monde en termes Ă©conomiques. Dans cette Ă©tude, nous avons cherchĂ© Ă exploiter les technologies NGS afin de contribuer Ă l'Ă©lucidation des caractĂ©ristiques gĂ©nomiques du soja. Pour ce faire, trois axes de recherche ont formĂ© le cĆur de cette thĂšse : 1) le gĂ©notypage pan-gĂ©nomique Ă faible coĂ»t, 2) la caractĂ©risation exhaustive des variants gĂ©nĂ©tiques par resĂ©quençage complet et 3) lâidentification de mutations Ă fort impact fonctionnel sur la base dâune forte sĂ©lection au sein des lignĂ©es Ă©lites. Un premier dĂ©fi en analyse gĂ©nĂ©tique ou gĂ©nomique est de rendre possible une caractĂ©risation rapide et peu coĂ»teuse dâun grand nombre de lignĂ©es Ă un trĂšs grand nombre de marqueurs rĂ©partis sur tout le gĂ©nome. Le gĂ©notypage par sĂ©quençage (GBS) permet d'effectuer simultanĂ©ment lâidentification et le gĂ©notypage de plusieurs milliers de SNP Ă l'Ă©chelle du gĂ©nome. Un des grands dĂ©fis en analyse GBS est dâextraire, dâune montagne de donnĂ©es issues du sĂ©quençage, un grand catalogue de SNP de haute qualitĂ© et de minimiser lâimpact des donnĂ©es manquantes. Dans une premiĂšre Ă©tape, nous avons grandement amĂ©liorĂ© le GBS en dĂ©veloppant un nouveau pipeline dâanalyse bio-informatique, Fast-GBS, conçu pour produire un appel de gĂ©notypes plus prĂ©cis et plus rapide que les outils existants. De plus, nous avons optimisĂ© des outils permettant dâeffectuer l'imputation des donnĂ©es manquantes. Ainsi, nous avons pu obtenir un catalogue de 60K marqueurs SNP au sein dâune collection de 301 accessions qui se voulait reprĂ©sentative de la diversitĂ© du soja au Canada. Dans un second temps, toutes les donnĂ©es manquantes (~50%) ont Ă©tĂ© imputĂ©es avec un trĂšs grand degrĂ© dâexactitude (98 %). Cette caractĂ©risation gĂ©nĂ©tique a Ă©tĂ© rĂ©alisĂ©e pour un coĂ»t modique, soit moins de 15 15 per line. Second, to fully characterize the nucleotide and structural variations (SNV and SV, respectively) in the soybean genome, we sequenced the whole genome of 102 Canadian soybean accessions. We have identified nearly 5M of nucleotide variants (SNP, MNP and Indels) with a high level of accuracy (98.6%). Then, using a combination of three different approaches, we detected ~ 92K SV (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that more than 90% were accurate. This is the first time that a complete description of the diversity of SNP and SV haplotypes has been carried out in a cultivated species. Finally, we have developed a systematic analytical approach to greatly facilitate the identification of genes whose alleles have undergone a very strong selection during domestication and selection. This approach is based on two recent advances in genomics: (1) whole-genome sequencing and (2) predicting mutations resulting in loss of function (LOF). Using this approach, we identified 130 candidate genes related to domestication or selection in soybean. This catalogue contains all of the previously well-characterized domestication genes in soybean, as well as some orthologues from other domesticated crop species. This list of genes provides many avenues of investigation for studies aimed at better understanding the genes that contribute strongly to shaping cultivated soybeans. This thesis ultimately leads to a better understanding of the genomic characteristics of soybeans. In addition, it provides several tools and genomic resources that could easily be used in future genomic research in soybeans as well as in other species
Application of genomic technologies to the horse
The publication of a draft equine genome sequence and the release by Illumina of a
50,000 marker single-nucleotide polymorphism (SNP) genotyping chip has provided
equine researchers with the opportunity to use new approaches to study the
relationships between genotype and phenotype. In particular, it is hoped that the use
of high-density markers applied to population samples will enable progress to be
made with regard to more complex diseases. The first objective of this thesis is to
explore the potential for the equine SNP chip to enable such studies to be performed
in the horse. The second objective is to investigate the genetic background of
osteochondrosis (OC) in the horse. These objectives have been tackled using 348
Thoroughbreds from the US, divided into cases and controls, and a further 836 UK
Thoroughbreds, the majority with no phenotype data. All horses had been genotyped
with the Illumina Equine SNP50 BeadChip.
Linkage disequilibrium (LD) is the non-random association of alleles at
neighbouring loci. The reliance of many genomic methodologies on LD between
neutral markers and causal variants makes it an important characteristic of genome
structure. In this thesis, the genomic data has been used to study the extent of LD in
the Thoroughbred and the results considered in terms of genome coverage. Results
suggest that the SNP chip offers good coverage of the genome. Published theoretical
relationships between LD and historical effective population size (Ne) were exploited
to enable accuracy predictions for genome-wide evaluation (GWE) to be made. A
subsequent in-depth exploration of this theory cast some doubt on the reliability of
this approach in the estimation of Ne, but the general conclusion that the
Thoroughbred population has a small Ne which should enable GWE to be carried out
efficiently in this population, remains valid. In the course of these studies, possible
errors embedded within the current sequence assembly were identified using
empirical approaches.
Osteochondrosis is a developmental orthopaedic disease which affects the joints of
young horses. Osteochondrosis is considered multifactorial in origin with a variety
of environmental factors and heredity having been implicated. In this thesis, a
genome-wide association study was carried out to identify quantitative trait loci
(QTL) associated with OC. A single SNP was found to be significantly associated
with OC. The low heritability of OC combined with the apparent lack of major QTL
suggests GWE as an alternative approach to tackle this disease. A GWE analysis
was carried out on the same dataset but the resulting genomic breeding values had no
predictive ability for OC status. This, combined with the small number of significant
QTL, indicates a lack of power which could be addressed in the future by increasing
sample size. An alternative to genotyping more horses for the 50K SNP chip would
be to use a low-density SNP panel and impute remaining genotypes. The final
chapter of this thesis examines the feasibility of this approach in the Thoroughbred.
Results suggest that genotyping only a subset of samples at high density and the
remainder at lower density could be an effective strategy to enable greater progress
to be made in the arena of equine genomics. Finally, this thesis provides an outlook
on the future for genomics in the horse.L.J. Corbin, J.A. Woolliams âData relating to Laura Corbin PhDâ (2016) Edinburgh DataVault [see 2nd link below