Search CORE

50 research outputs found

Haplotype estimation in polyploids using DNA sequence data

Author: Motazedi Ehsan
Publication venue: Wageningen University
Publication date: 01/01/2019
Field of study

Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

Wageningen University & Research Publications

Recommended from our members

Quantifying recent variation and relatedness in human populations

Author: Gusev Alexander
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

Advances in the genetic analysis of humans have revealed a surprising abundance of local relatedness between purportedly unrelated individuals. Where common mutations classically inform us of ancient relationships, such segments of pairwise identical by descent (IBD) sharing from a common ancestor are the observable traces of recent inter-mating. Combining these two distinct sources of information can help disentangle the complex genetic structure and flux in human populations. When considered together with a heritable trait, the segments can also be used to interrogate unascertained rare variation and help in locating trait-effecting loci. This work presents methods for comprehensive analysis of population-wide IBD and explores applications to disease and the understanding of recent genetic variation. We propose several strategies for efficient detection of IBD segments in population genotype data. Our novel seed-based algorithm, GERMLINE, can reduce the computational burden of finding pairwise segments from quadratic to nearly linear time in a general population. We demonstrate that this approach is several orders of magnitude faster than the available all-pairs methods while maintaining higher accuracy. Next, we extended the GERMLINE technique to process cohorts of unlimited size by adaptively adjusting the search mechanism to meet resource restrictions. We confirm its effectiveness with an analysis of 50,000 individuals where contemporary methods can only process a few thousand. One draw-back of these two algorithms is the dependence on phased haplotype data as input - a constraint that becomes more difficult with large populations. We propose a solution to this problem with an algorithm that analyzes genotype data directly by exploring all potential haplotypes and scoring each putative segment based on linkage-disequilibrium. This solution significantly outperforms available methods when applied to full sequence data and is computationally efficient enough to analyze thousands of sequenced genomes where current methods can only determine haplotypes for several hundred. Secondly, we outline two algorithms for analyzing available IBD segments to increase our understanding of rare variation and complex disease. Motivated by whole-genome sequencing, we present the INFOSTIP algorithm, which uses IBD segments to optimize the selection of individuals for complete population ascertainment. In simulations, we show that INFOSTIP selection can significantly increase variant inference accuracy over random sampling and posit inference of 60% of an isolated population from 1% optimally selected individuals. Seeking to move beyond pairwise IBD segment analysis, we describe the DASH algorithm, which groups shared segments into IBD "clusters" that are likely to be commonly co-inherited and uses them as proxies for un-typed variation. In simulated disease studies, we show this reference-free approach to be much more powerful for detecting rare causal variants than either traditional single-marker analysis or imputation from a general reference panel. Applying the DASH algorithm to disease traits from different populations, we identify multiple novel loci of association. Together, these novel techniques integrate the power of population and disease genetics

Columbia University Academic Commons

Genotype imputation as a genomic strategy for the SA Drakensberger beef breed

Author: Lashmar Simon Frederick
Publication venue: 'University of Pretoria - Department of Philosophy'
Publication date: 01/01/2020
Field of study

Indigenous breeds such as the South African (SA) Drakensberger are economically important genetic resources in local beef production because of their adaptive traits and ability to perform competitively at a commercial level. Genomic selection (GS) is a promising technology to accelerate genetic progress in traits relevant to commercial beef production. A major obstacle in applying this methodology has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNPs). Cost reduction can be achieved by exploiting genotype imputation in GS workflows by means of genotyping at lower densities and imputing upwards. The overarching aim of this study was to conduct an investigation into the practicality of applying imputation in such a workflow utilizing genotypic data for 1 135 SA Drakensberger animals genotyped for 139 480 SNPs. As a pre-imputation step, the objective was firstly to elucidate inter- and intra-chromosomal patterns in genomic characteristics that may contribute to variability in achievable imputation accuracy across the genome. Inter-chromosomal differences in the proportion of low minor allele frequency (MAF) SNPs estimated varied from 6.6% for Bos Taurus autosome (BTA) 23 to 16.0% for BTA14. Pairwise linkage disequilibrium (LD), between adjacent SNPs, ranged from r2=0.11 (BTA28) to 0.17 (BTA14). The largest run of homozygosity (ROH), located on BTA13, was 225.82 kilobases (kb) in length and was identified in 23% of the animals sampled. The ROH-based inbreeding coefficients (FROH) estimated (e.g. FROH>1Mb=0.07, where FROH>1Mb denotes FROH calculated for all ROH longer than 1 megabase pair), indicated sufficient within-breed relatedness to achieve accurate imputation. During the imputation step, imputation accuracy from several custom-derived lower density panels varying in SNP density and the SNP selection strategy were compared. Imputation accuracy increased as SNP density increased; a genotyping panel consisting of 10 000 SNPs, selected based on a combination of their MAF and LD with neighbouring SNPs, could be used to achieve <3% imputation error on average. At this density of SNPs, a mean correlation coefficient (±standard deviation) between true- and imputed SNPs of 0.972±0.024 was achieved in a set of validation animals (n=235). Low MAF SNPs were imputed with lesser accuracy; a difference of 0.071 units was observed between the mean accuracy of imputed SNP categorized into low- (0.01<MAF≤0.1) versus high MAF (0.4<MAF<0.5) classes. Post-imputation, the utility of imputed genotypes in genomic breeding value (GEBV) estimation was evaluated by comparing prediction accuracies achieved from the use of true versus imputed SNPs in generating the H-inverse matrix applied in single-step GS. Breeding values were estimated for two growth traits, considering direct and maternal components. Prediction accuracies were improved by using genomic information in addition to traditional pedigree information; the largest improvement (0.026 units increase in accuracy) was observed for maternal birth weight. Marginal differences were observed between GEBV accuracies produced from true (GEBV_TRUE) versus imputed genotypes (GEBV_IMPUTED); for example the mean±standard deviation in GEBV_TRUE=0.774±0.056 versus GEBV_IMPUTED=0.773±0.055 accuracy was observed for direct birth weight, suggesting that imputation errors had an almost negligible influence. Results presented in this thesis demonstrated the usefulness of imputation as a viable genomic strategy towards low-cost implementation of genomically enhanced prediction of EBVs for a breed such as the SA Drakensberger.Thesis (PhD)--University of Pretoria, 2020.Animal and Wildlife SciencesPhD (Animal Science)Unrestricte

UPSpace at the University of Pretoria

Genomic evaluation considering the mosaic genome of the crossbred pig

Author: Sevillano Claudia A.
Publication venue: Wageningen University
Publication date
Field of study

In pigs, the breeding goal is to improve performance of crossbred (CB) animals in commercial farms. The best purebred (PB) animals to produce CB animals can be selected based on their genomic estimated breeding value (GEBV) for CB performance. GEBVs are the result of combining estimated effects of single nucleotide polymorphisms (SNPs) with the animal’s genotype. Using CB genomic information allows to estimate SNP allele effects accounting for the CB genetic background. The genome of CB animals is a mosaic of genomic regions inherited from the different parental breeds, therefore, this thesis aimed to investigate whether SNP alleles have different effects depending from which parental breed the allele was inherited and study the impact on GEBV of PB animals for CB performance when breed-specific allele effects were taken into consideration. Firstly, I showed that around 94 % of alleles of three-way CB pigs can be assigned a breed of origin. Knowing this, allowed me to implement a model that accounts for breed-specific effects of all SNP alleles. Using results of this model, I showed that estimated effects and explained variance of SNPs strongly associated with CB performance are different depending upon from which parental breed they were inherited, however, the majority of the genomic regions are not or only weakly associated with CB performance. Therefore, I implemented a new model that allows to estimate breed-specific effects only for alleles of SNPs strongly associated with CB performance, and for the rest of the SNPs assumes that allele effects are the same across breeds. Differences of prediction accuracies between models were generally small. When the estimated genetic correlation between the performance of PB and CB animals per breed of origin differed largely across models, it was better to use models that make a distinction of alleles according to their breed of origin and whether or not they were associated to the trait.</p

Wageningen University & Research Publications

Methods and Applications for Collection, Contamination Estimation, and Linkage Analysis of Large-scale Human Genotype Data

Author: Zajac Gregory
Publication venue
Publication date: 01/01/2020
Field of study

In recent decades statistical genetics has contributed substantially to our knowledge of human health and biology. This research has many facets: from collecting data, to cleaning, to analyzing. As the scope of the scientific questions considered and the scale of the data continue to increase, these bring additional challenges to every step of the process. In this dissertation, I describe novel approaches for each of these three steps, focused on the specific problems of participant recruitment and engagement, DNA contamination estimation, and linkage analysis with large data sets. In Chapter 1, we introduce the subject of this dissertation and how it fits with other developments in the generation, analysis and interpretation of human genetic data. In Chapter 2, we describe Genes for Good, a new platform for engaging a large, diverse participant pool in genetics research through social media. We developed a Facebook application where participants can sign up, take surveys related to their health, and easily invite interested friends to join. After completing a required number of these surveys, we send participants a spit kit to collect their DNA. In a statistical analysis of 27,000 individuals from all over the United States genotyped in our study, we replicated health trends and genetic associations, showing the utility of our approach and accuracy of self-reported phenotypes we collected. In Chapter 3, we introduce VICES (Verify Intensity Contamination from Estimated Sources), a statistical method for joint estimation of DNA contamination and its sources in genotyping arrays. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. VICES jointly estimates the total proportion of contaminating DNA and identify which samples it came from by regressing deviations in probe intensity for a sample being tested on the genotypes of another sample. Through analysis of array intensity and genotype data from HapMap samples and the Michigan Genomics Initiative, we show that our method reliably estimates contamination more accurately than existing methods and implicates problematic steps to guide process improvements. In Chapter 4, we propose Population Linkage, a novel approach to perform linkage analysis on genome-wide genotype data from tens of thousands of arbitrarily related individuals. Our method estimates kinship and identical-by-descent segments (IBD) between all pairs of individuals, fits them as variance components using Haseman-Elston regression, and tests for linkage. This chapter addresses how to iteratively assess evidence of linkage in large numbers of individuals across the genome, reduce repeated calculations, model relationships without pedigrees, and determine segregation of genomic segments between relatives using single-nucleotide polymorphism (SNP) genotypes. After applying our method to 6,602 individuals from the National Institute on Aging (NIA) SardiNIA study and 69,716 individuals from the Trøndelag Health Study (HUNT), we show that most of our signals overlapped with known GWAS loci and many of these could explain a greater proportion of the trait variance than the top GWAS SNP. In Chapter 5, we discuss the impact and future directions for the work presented in this dissertation. We have proposed novel approaches for gathering useful research data, checking its quality, and detecting associations in the investigation of human genetics. Also, this work serves as an example for thinking about the process of human genetic discovery from beginning to end as a whole and understanding the role of each part.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162998/1/gzajac_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

Computational methods for single cell RNA and genome assembly resolution using genetic variation

Author: Heaton William
Publication venue: University of Cambridge
Publication date: 09/12/2021
Field of study

Genetic variation and natural selection have driven the evolutionary history on this planet and are responsible for creating us and all other life as we know it. Over the past several decades, the genomic revolution has allowed us to assess population variation across humans and other species and use that to link genotypes with phenotypes and infer evolutionary histories. In this thesis, I explore computational methods for using genetic variation to demultiplex and disambiguate complex data. In single cell RNAseq, problems of batch effects, doublets, and ambient RNA are each sources of noise that impede our ability to infer the functional states of cells and compare them between experiments. One new popular new experimental design promising to solve each of these while also reducing experimental costs is mixturing multiple individuals' cells into a single experiment. In chapter 2, I present a method for clustering cells by genotype, calling doublets, and using the cross-genotype signal in singletons to estimate and remove ambient RNA. I compare this methods to other existing methods including one that requires \textit{a priori} information about the genotypes, and two which do not. I find that my method outperforms each of these methods across a wide range of data parameters and sample types. In genome assembly, the recent higher throughput and lower cost of long read sequencing has revolutionized our ability to create reference quality genomes and has revitalized the assembly community. Now, massive efforts are taking place in the Darwin Tree of Life project and the Earth Biogenome project to create reference genomes for all multicelular eukaryotic life. This will create a scientific resource for the next generation of biological science, will serve as a conservation of data that could otherwise be lost in this time of mass extinction, and will allow for a much more broad understanding of evolution and the evolutionary history of life on Earth. While much progress has been made in data quality and assembly algorithms, some problems still exist. Until recently, the DNA input requirements for long read sequencing technologies made it impossible to sequence single individuals of these species with long reads. Also, high heterozygosity makes assembly more difficult due to the inherent ambiguity between heterozygous sequence versus paralogous sequence when confronted with inexact homology. One solution to the DNA input requirements would be to pool individuals, but this only increases the heterozygosity of the sample and reduces assembly quality. In chapter 3, we present the first high quality assembly of a single mosquito using new library preparation methods with reduced DNA requirements. This reduces the number of haplotypes to two, improving the assembly quality. In chapter 4, we further address the problems brought on by heterozygosity in assembly. I present a suite of tools that use the phasing consistency of multiple heterozygous sequences as a signal for physical linkage, thus using genetic variation to our advantage rather than as a challenge to overcome. This tool creates phased, linked assemblies and phasing aware scaffolding. Further, I provide a tool for phasing aware scaffolding on existing assemblies. This includes a novel haplotype phasing algorithm with some unique beneficial properties. It is robust to non-heterozygous variants as input and can detect and correct those genotypes. And it naturally extends to polyploid genomes.Wellcome Trus

Apollo (Cambridge)

Arrays and beyond: Evaluation of marker technologies for chicken genomics

Author: Geibel Johannes
Publication venue: University Goettingen Repository
Publication date: 12/11/2021
Field of study

Eine zentrale Forschungsfrage in der Nutztierforschung ist, wie die phänotypische Vielfalt von Nutztieren durch ihre genomische Vielfalt geprägt wird. Die genomische Vielfalt wird dabei durch genomische Marker beschrieben. Die Verwendung und Definition von genomischen Markern ist stark technologieabhängig und ändert sich daher im Laufe der Zeit. In den letzten Jahren haben sich Einzelnukleotidpolymorphismen (SNPs) zur wichtigsten Markerklasse entwickelt. Außerdem waren SNP-Arrays in den letzten Jahren aufgrund ihrer frühen Verfügbarkeit die Genotypisierungstechnologie der Wahl. Sie werden jedoch derzeit teilweise durch die Ganzgenomsequenzierung (WGS) zur SNP-Bestimmung verdrängt. Darüber hinaus rücken Strukturelle Varianten (SV) mehr und mehr in den Fokus der Forschung. In diesem Zusammenhang zielt die vorliegende Arbeit darauf ab, die Aussagekraft von SNP-Markern auf verschiedene Weise zu bewerten, wobei der Schwerpunkt auf Hühnern als einer vielfältigen Nutztierart mit großer landwirtschaftlicher Bedeutung liegt. In Kapitel 1 wird der aktuelle Wissensstand über genomische Variation, Markertechnologien und deren Einsatz in der Nutztierwissenschaft, insbesondere bei Hühnern, dargestellt. Kapitel 2 und 3 befassen sich dann mit einem systematischen Fehler von SNP-Arrays, dem SNP Ascertainment Bias. Der SNP Ascertainment Bias ist eine systematische Verschiebung des Allelfrequenzspektrums von SNP-Arrays hin zu häufigeren SNPs aufgrund der Vorauswahl von SNPs in einer begrenzten Anzahl von Individuen aus wenigen Populationen. Kapitel 2 zielt darauf ab, das Ausmaß des Bias für einen Standard-SNP-Array für Hühner und die Schritte des Array-Designs, die den Bias verursacht haben, zu bewerten. In der Studie haben wir daher den Designprozess des Hühnerarrays auf der Grundlage von (gepoolten) WGS verschiedener Hühnerpopulationen nachgestellt. Dabei zeigte sich eine sequentielle Reduktion seltener Allele während des Designprozesses, die vor allem durch die anfängliche Begrenzung des Discovery Sets und eine spätere Selektion von häufigen SNPs innerhalb der Populationen bei gleichzeitigem anstreben von äquidistanten Abständen verursacht wurde. Eine Vergrößerung des Discovery Panels hatte den größten Einfluss auf eine Begrenzung des Ascertainment Bias. Andere Schritte, wie z. B. die Validierung der SNPs in einem breiteren Set von Populationen, zeigten keine relevanten Auswirkungen. Korrekturmethoden für den Ascertainment Bias sind in Studien bisher meist nicht durchführbar. In Kapitel 3 wird daher vorgeschlagen, die Imputation der Array-Daten auf WGS-Niveau als in silico Korrekturmethode für das Allelfrequenzspektrum zu verwenden. Die Studie zeigte, dass die Imputation in der Lage ist, die Auswirkungen von Erhebungsfehlern stark zu reduzieren, selbst wenn ein sehr kleines Referenzpanel verwendet wurde. Es wurde jedoch auch deutlich, dass das Referenzpanel dann den gleichen Effekt wie das Discovery-Panel während des Array-Designs hat. Daher ist es von entscheidender Bedeutung, dass die Proben für das Referenzpanel gleichmäßig über das Populationsspektrum verteilt ausgewählt werden. SVs sind schwieriger zu bestimmen und zu genotypisieren als SNPs. Daher stellt sich die Frage, ob die Effekte von SV auch durch SNP-basierte Studien erfasst werden. Das wäre der Fall, wenn zwischen SNPs und SVs ein starkes Kopplungsungleichgewicht (LD) besteht. Dies wird in Kapitel 4 für drei kommerzielle Hühnerrassen auf der Grundlage von WGS-Daten untersucht. Die Studie zeigte, dass das LD zwischen Deletionen und SNPs auf dem gleichen Niveau lag wie das LD zwischen SNPs und anderen SNPs, was darauf hindeutet, dass Effekte von Deletionen von SNP-Marker-Panels genauso gut erfasst werden wie SNP-Effekte. Das LD zwischen SNPs und anderen SVs war stark reduziert. Der Hauptfaktor für diese Verringerung waren lokale Unterschiede zu SNPs in Bezug auf die Minor-Allel-Frequenz. Eine Reduktion der homozygoten Varianten für Nicht-Deletions-SVs im Vergleich zur Erwartung unter Hardy-Weinberg-Gleichgewicht kann jedoch auf Probleme der verwendeten SV-Genotypisierer hinweisen. Im letzten Kapitel (Kapitel 5) werden die Auswirkungen des Ascertainment Bias und die Möglichkeiten, damit in der Hühnergenomforschung (und auch generell in der Nutztiergenomforschung) umzugehen, diskutiert. Außerdem werden die Möglichkeiten der Einbeziehung von SV in Studien bewertet. Es wird auch erörtert, was notwendig ist, um die Informationen aus verschiedenen genomischen Datensätzen zu kombinieren damit der Aussagewert von Studien erhöht wird. Abschließend wird ein Ausblick darauf gegeben, welche Informationen aufgrund der jüngsten technologischen Fortschritte in naher Zukunft zusätzlich verfügbar sein werden.A key research question in livestock research is how livestock’s phenotypic diversity is shaped by its genomic diversity. Genomic diversity is thereby assessed through genomic markers. The use and definition of genomic markers is strongly technology driven and therefore changes through time. During the last years, single nucleotide polymorphisms (SNPs) have become the main marker class. Additionally, SNP arrays have been the genotyping technology of choice during the last years due to their early availability. They are, however, currently partially displaced by whole-genome-sequencing (WGS) for SNP calling. Further, structural variants (SV) are moving more and more into the focus of researchers. In this context, the thesis aims in evaluating the value of SNP markers in various ways with its main focus on chickens as a diverse livestock species with major agricultural value. In Chapter 1, the current knowledge of genomic variation, marker technologies, and their use in livestock sciences, especially in chickens, is reviewed. Chapter 2 and 3 then address a systematic error of SNP arrays, the SNP ascertainment bias. SNP ascertainment bias is a systematic shift of the allele frequency spectrum of SNP arrays towards more common SNPs due to the pre-selection of SNPs in a limited number of individuals of few populations. Chapter 2 aims in assessing the magnitude of the bias for a standard chicken SNP array and the steps of array design that created the bias. In the study, we therefore remodeled the design process of the chicken array based on (pooled) WGS of various chicken populations. This revealed a sequential reduction of rare alleles during the design process, which was mainly caused by the initial limitation of the discovery set and a later within-population selection of common SNPs while aiming for equidistant spacing. Increasing the discovery set had the largest impact on limiting ascertainment bias. Other steps, as e.g. validation of the SNPs in a broader set of populations did not show relevant effects. Correction methods for ascertainment bias are by now often unfeasible in studies. Chapter 3 therefore proposes to use imputation of the array data to WGS level as an in silico correction method of the allele frequency spectrum. The study revealed that imputation is able to strongly reduce the effects of ascertainment bias, even when a very sparse reference panel was used. However, it became also obvious that the reference panel then has the same effect as the discovery panel during array design. It is therefore crucial to select samples for the reference panel evenly spaced across the intended range of populations. SVs are harder to call and genotype than SNPs. Therefore, the question arises whether effects of SV are captured by SNP-based studies due to strong linkage disequilibrium between SNPs and SVs. This is assessed in Chapter 4 for three commercial chicken breeds, based on WGS data. The study showed that LD between deletions and SNPs was on the same level as LD between SNPs and other SNPs, indicating that deletion effects are captured by SNP marker panels as good as SNP effects. LD between SNPs and other SVs was strongly reduced. The main factor for this reduction was local differences to SNPs in terms of minor allele frequency. However, a reduction of homozygous variant calls for non-deletion SVs compared to the Hardy-Weinberg-expectation may indicate problems of the used SV genotypers. In the last chapter (Chapter 5), the impact of ascertainment bias and possibilities to deal with it in chicken genomics (and also more general in livestock genomics) is discussed. Further, the potentials of including SVs into studies are evaluated. It also discusses what is necessary to combine the information of different genomic data sets to leverage the value of analyses. Finally, an outlook on what information will be additionally available in near future based on recent technological advances is given.2022-01-1

Georg-August-University Göttingen

CONTRIBUTION TO LINKAGE AND ASSOCIATION MAPPING OF TRAIT LOCI IN LIVESTOCK.

Author: Zhang Zhiyan
Publication venue: Université de Liège, Liège, Belgique
Publication date: 01/11/2013
Field of study

Until recently, breeding values were estimated based on phenotypes measured on the individual and its relatives, and the notion that the covariance between breeding values is proportionate to the kinship coefficient. Advances in genomics now allow for direct analysis of the genome and identification of the loci that determine the breeding values of individuals. As a consequence, marker assisted selection and genomic selection have become more effective and are replacing conventional selection. The identification of loci influencing the traits of interest requires the use of advanced statistical methods that are constantly evolving. In the context of this thesis, we have (i) contributed to the development of gene mapping methods, (ii) applied these methods to map loci influencing both metric and meristic traits, and (iii) contributed to the development of methods for the integration of genomic information in livestock breeding and management. The mapping methods that we have helped developing distinguish themselves mainly by the fact that (i) they exploit haplotype information (by means of a hidden markov model) which should increase the linkage disequilibrium with causative variants and hence detection power, (ii) they can simultaneously extract linkage information within families, and linkage disequilibrium information across the population, and (iii) they correct for population stratification by means of a random polygenic effect, and (iv) they can be applied to binary as well as quantitative traits. We have applied these and other methods to map loci influencing (i) quantitative hematological parameters in a porcine line-cross, and (ii) binary traits including diseases in bovine and non-syntenic Copy Number Variants in cattle, horse and human. In fine, we have contributed to the development of methods for the utilization of marker information in animal selection and production. We have extended the haplotype-based mapping method to allow imputation and have evaluated the utility of this approach in scenarios mimicking reality. We have also contributed to the development of a method to quantify somatic cell counts in the milk of individual cows by genotyping a sample of milk from the farm’s tank (hence a mixture of milk from all cows on the farm) Our work has resulted in the development of a software package (“GLASCOW”) that is increasingly used by the community to map genes influencing complex traits, primarily binary. By using this tool, we have contributed to the localization of several trait loci in pig, cattle, horse and human. We have contributed to the development of approaches that reduce the costs of genomic analyses in livestock by, on the one hand, complementing real SNP genotypes with genotypes obtained in silico by means imputation, and, on the other hand, by developing a method to deconvolute genotypes obtained on DNA pools

Open Repository and Bibliography - Liège

Analyse de la variation nucléotidique et structurale chez le soja par une approche de re-séquençage

Author: Torkamaneh Davoud
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2017
Field of study

Le séquençage de nouvelle génération (NGS) a révolutionné la recherche chez les plantes et les animaux de plusieurs façons, y compris via le développement de nouvelles méthodes de génotypage à haut débit pour accélérer considérablement l'étude de la composition des génomes et de leurs fonctions. Dans le cadre du projet SoyaGen, financé par Génome Canada, nous cherchons à mieux comprendre la diversité génétique et l'architecture sous-jacente régissant les principaux caractères agronomiques chez le soja. Le soja est la plus importante culture oléagineuse au monde en termes économiques. Dans cette étude, nous avons cherché à exploiter les technologies NGS afin de contribuer à l'élucidation des caractéristiques génomiques du soja. Pour ce faire, trois axes de recherche ont formé le cœur de cette thèse : 1) le génotypage pan-génomique à faible coût, 2) la caractérisation exhaustive des variants génétiques par reséquençage complet et 3) l’identification de mutations à fort impact fonctionnel sur la base d’une forte sélection au sein des lignées élites. Un premier défi en analyse génétique ou génomique est de rendre possible une caractérisation rapide et peu coûteuse d’un grand nombre de lignées à un très grand nombre de marqueurs répartis sur tout le génome. Le génotypage par séquençage (GBS) permet d'effectuer simultanément l’identification et le génotypage de plusieurs milliers de SNP à l'échelle du génome. Un des grands défis en analyse GBS est d’extraire, d’une montagne de données issues du séquençage, un grand catalogue de SNP de haute qualité et de minimiser l’impact des données manquantes. Dans une première étape, nous avons grandement amélioré le GBS en développant un nouveau pipeline d’analyse bio-informatique, Fast-GBS, conçu pour produire un appel de génotypes plus précis et plus rapide que les outils existants. De plus, nous avons optimisé des outils permettant d’effectuer l'imputation des données manquantes. Ainsi, nous avons pu obtenir un catalogue de 60K marqueurs SNP au sein d’une collection de 301 accessions qui se voulait représentative de la diversité du soja au Canada. Dans un second temps, toutes les données manquantes (~50%) ont été imputées avec un très grand degré d’exactitude (98 %). Cette caractérisation génétique a été réalisée pour un coût modique, soit moins de 15

par lignée. Deuxièmement, pour caractériser de manière exhaustive les variations nucléotidiques et structurelles (SNV et SV, respectivement) dans le génome du soja, nous avons séquencé le génome entier de 102 accessions de soja au Canada. Nous avons identifié près de 5M de variants nucléotidiques (SNP, MNP et Indels) avec un haut niveau d’exactitude (98,6 %). Ensuite, en utilisant une combinaison de trois approches différentes, nous avons détecté ~92K SV (délétions, insertions, inversions, duplications, CNV et translocations) et estimé que plus de 90 % étaient exacts. C'est la première fois qu'une description complète de la diversité des haplotypes SNP et du SV a été réalisée chez une espèce cultivée. Enfin, nous avons mis au point une approche analytique systématique pour faciliter grandement l’identification de gènes dont des allèles ont fait l’objet d’une très forte sélection au cours de la domestication et de la sélection. Cette approche repose sur deux progrès récents en génomique : 1) le séquençage de génomes entiers et 2) la prédiction des mutations entraînant une perte de fonction (LOF pour « loss of function »). En utilisant cette approche, nous avons identifié 130 gènes candidats liés à la domestication ou à la sélection chez le soja. Ce catalogue contient tous les gènes de domestication précédemment caractérisés chez le soja, ainsi que certains orthologues chez d'autres espèces cultivées. Cette liste de gènes fournit de nombreuses pistes d’investigation pour des études visant à mieux comprendre les gènes qui contribuent fortement à façonner le soja cultivé. Cette thèse permet ultimement une meilleure compréhension des caractéristiques génomiques du soja. En outre, elle fournit plusieurs outils et références génomiques qui pourraient facilement être utilisés dans de futures recherches en génomique chez le soja de même que chez d’autres espèces.Next-generation sequencing (NGS) has revolutionized plants and animals research in many ways, including the development of new high-throughput genotyping methods to accelerate considerably the composition of genomes and their functions. As part of the SoyaGen project, funded by Genome Canada, we are seeking to better understand the genetic diversity and underlying architecture governing major agronomic traits in soybeans. Soybean is the world's largest oilseed crop in economic terms. In this study, we sought to exploit NGS technologies to help elucidate the genomic characteristics of soybeans. To this end, three main research topics have formed the core of this thesis: 1) low-cost genome-wide genotyping, 2) exhaustive characterization of genetic variants by whole-genome resequencing, and 3) identification of mutations with high functional impact on the basis of a strong selection within the elite lines. A first challenge in genetic or genomic analysis is to make possible a rapid and inexpensive characterization of a large number of lines with a very large number of markers distributed throughout the genome. Genotyping-by-sequencing (GBS) allows simultaneous identification and genotyping of several thousand SNPs on a genome-wide scale. One of the major challenges in GBS analysis is to extract a large catalog of high quality SNP from a mountain of sequencing data and minimize the impact of missing data. As a first step, we have greatly improved the GBS by developing a new bio-informatics analysis pipeline, Fast-GBS, designed to produce a more accurate and faster call of genotypes than existing tools. In addition, we have optimized tools for imputing missing data. For example, we were able to obtain a catalog of 60K SNP markers from a collection of 301 accessions that were representative of soybean diversity in Canada. Second, all missing data (~ 50%) were imputed with a very high degree of accuracy (98%). This genetic characterization was performed at a low cost, less than

15 per line. Second, to fully characterize the nucleotide and structural variations (SNV and SV, respectively) in the soybean genome, we sequenced the whole genome of 102 Canadian soybean accessions. We have identified nearly 5M of nucleotide variants (SNP, MNP and Indels) with a high level of accuracy (98.6%). Then, using a combination of three different approaches, we detected ~ 92K SV (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that more than 90% were accurate. This is the first time that a complete description of the diversity of SNP and SV haplotypes has been carried out in a cultivated species. Finally, we have developed a systematic analytical approach to greatly facilitate the identification of genes whose alleles have undergone a very strong selection during domestication and selection. This approach is based on two recent advances in genomics: (1) whole-genome sequencing and (2) predicting mutations resulting in loss of function (LOF). Using this approach, we identified 130 candidate genes related to domestication or selection in soybean. This catalogue contains all of the previously well-characterized domestication genes in soybean, as well as some orthologues from other domesticated crop species. This list of genes provides many avenues of investigation for studies aimed at better understanding the genes that contribute strongly to shaping cultivated soybeans. This thesis ultimately leads to a better understanding of the genomic characteristics of soybeans. In addition, it provides several tools and genomic resources that could easily be used in future genomic research in soybeans as well as in other species

CorpusUL

Application of genomic technologies to the horse

Author: Corbin Laura Jayne
Publication venue: The University of Edinburgh
Publication date: 01/01/2013
Field of study

The publication of a draft equine genome sequence and the release by Illumina of a 50,000 marker single-nucleotide polymorphism (SNP) genotyping chip has provided equine researchers with the opportunity to use new approaches to study the relationships between genotype and phenotype. In particular, it is hoped that the use of high-density markers applied to population samples will enable progress to be made with regard to more complex diseases. The first objective of this thesis is to explore the potential for the equine SNP chip to enable such studies to be performed in the horse. The second objective is to investigate the genetic background of osteochondrosis (OC) in the horse. These objectives have been tackled using 348 Thoroughbreds from the US, divided into cases and controls, and a further 836 UK Thoroughbreds, the majority with no phenotype data. All horses had been genotyped with the Illumina Equine SNP50 BeadChip. Linkage disequilibrium (LD) is the non-random association of alleles at neighbouring loci. The reliance of many genomic methodologies on LD between neutral markers and causal variants makes it an important characteristic of genome structure. In this thesis, the genomic data has been used to study the extent of LD in the Thoroughbred and the results considered in terms of genome coverage. Results suggest that the SNP chip offers good coverage of the genome. Published theoretical relationships between LD and historical effective population size (Ne) were exploited to enable accuracy predictions for genome-wide evaluation (GWE) to be made. A subsequent in-depth exploration of this theory cast some doubt on the reliability of this approach in the estimation of Ne, but the general conclusion that the Thoroughbred population has a small Ne which should enable GWE to be carried out efficiently in this population, remains valid. In the course of these studies, possible errors embedded within the current sequence assembly were identified using empirical approaches. Osteochondrosis is a developmental orthopaedic disease which affects the joints of young horses. Osteochondrosis is considered multifactorial in origin with a variety of environmental factors and heredity having been implicated. In this thesis, a genome-wide association study was carried out to identify quantitative trait loci (QTL) associated with OC. A single SNP was found to be significantly associated with OC. The low heritability of OC combined with the apparent lack of major QTL suggests GWE as an alternative approach to tackle this disease. A GWE analysis was carried out on the same dataset but the resulting genomic breeding values had no predictive ability for OC status. This, combined with the small number of significant QTL, indicates a lack of power which could be addressed in the future by increasing sample size. An alternative to genotyping more horses for the 50K SNP chip would be to use a low-density SNP panel and impute remaining genotypes. The final chapter of this thesis examines the feasibility of this approach in the Thoroughbred. Results suggest that genotyping only a subset of samples at high density and the remainder at lower density could be an effective strategy to enable greater progress to be made in the arena of equine genomics. Finally, this thesis provides an outlook on the future for genomics in the horse.L.J. Corbin, J.A. Woolliams “Data relating to Laura Corbin PhD” (2016) Edinburgh DataVault [see 2nd link below

Edinburgh Research Archive

Explore Bristol Research