301 research outputs found

    Spatial normalization improves the quality of genotype calling for Affymetrix SNP 6.0 arrays

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray measurements are susceptible to a variety of experimental artifacts, some of which give rise to systematic biases that are spatially dependent in a unique way on each chip. It is likely that such artifacts affect many SNP arrays, but the normalization methods used in currently available genotyping algorithms make no attempt at spatial bias correction. Here, we propose an effective single-chip spatial bias removal procedure for Affymetrix 6.0 SNP arrays or platforms with similar design features. This procedure deals with both extreme and subtle biases and is intended to be applied before standard genotype calling algorithms.</p> <p>Results</p> <p>Application of the spatial bias adjustments on HapMap samples resulted in higher genotype call rates with equal or even better accuracy for thousands of SNPs. Consequently the normalization procedure is expected to lead to more meaningful biological inferences and could be valuable for genome-wide SNP analysis.</p> <p>Conclusions</p> <p>Spatial normalization can potentially rescue thousands of SNPs in a genetic study at the small cost of computational time. The approach is implemented in R and available from the authors upon request.</p

    Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations

    Get PDF
    Background: The coupling of appropriate sequencing strategies and imputation methods is critical for assembling large whole-genome sequence datasets from livestock populations for research and breeding. In this paper, we describe and validate the coupling of a sequencing strategy with the imputation method hybrid peeling in real animal breeding settings. Methods: We used data from four pig populations of different size (18,349 to 107,815 individuals) that were widely genotyped at densities between 15,000 and 75,000 markers genome-wide. Around 2% of the individuals in each population were sequenced (most of them at 1× or 2× and 37-92 individuals per population, totalling 284, at 15-30×). We imputed whole-genome sequence data with hybrid peeling. We evaluated the imputation accuracy by removing the sequence data of the 284 individuals with high coverage, using a leave-one-out design. We simulated data that mimicked the sequencing strategy used in the real populations to quantify the factors that affected the individual-wise and variant-wise imputation accuracies using regression trees. Results: Imputation accuracy was high for the majority of individuals in all four populations (median individual-wise dosage correlation: 0.97). Imputation accuracy was lower for individuals in the earliest generations of each population than for the rest, due to the lack of marker array data for themselves and their ancestors. The main factors that determined the individual-wise imputation accuracy were the genotyping status, the availability of marker array data for immediate ancestors, and the degree of connectedness to the rest of the population, but sequencing coverage of the relatives had no effect. The main factors that determined variant-wise imputation accuracy were the minor allele frequency and the number of individuals with sequencing coverage at each variant site. Results were validated with the empirical observations. Conclusions: We demonstrate that the coupling of an appropriate sequencing strategy and hybrid peeling is a powerful strategy for generating whole-genome sequence data with high accuracy in large pedigreed populations where only a small fraction of individuals (2%) had been sequenced, mostly at low coverage. This is a critical step for the successful implementation of whole-genome sequence data for genomic prediction and fine-mapping of causal variantsThe authors acknowledge the financial support from the BBSRC ISPG to The Roslin Institute (BBS/E/D/30002275), from Genus plc, Innovate UK (Grant 102271), and from Grant numbers BB/N004736/1, BB/N015339/1, BB/L020467/1, and BB/M009254/1

    Copynumber: Efficient algorithms for single- and multi-track copy number segmentation.

    Get PDF
    BACKGROUND: Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number. RESULTS: A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented. CONCLUSIONS: The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Genotype Imputation with Thousands of Genomes

    Get PDF
    Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package

    Arrays and beyond: Evaluation of marker technologies for chicken genomics

    Get PDF
    Eine zentrale Forschungsfrage in der Nutztierforschung ist, wie die phänotypische Vielfalt von Nutztieren durch ihre genomische Vielfalt geprägt wird. Die genomische Vielfalt wird dabei durch genomische Marker beschrieben. Die Verwendung und Definition von genomischen Markern ist stark technologieabhängig und ändert sich daher im Laufe der Zeit. In den letzten Jahren haben sich Einzelnukleotidpolymorphismen (SNPs) zur wichtigsten Markerklasse entwickelt. Außerdem waren SNP-Arrays in den letzten Jahren aufgrund ihrer frühen Verfügbarkeit die Genotypisierungstechnologie der Wahl. Sie werden jedoch derzeit teilweise durch die Ganzgenomsequenzierung (WGS) zur SNP-Bestimmung verdrängt. Darüber hinaus rücken Strukturelle Varianten (SV) mehr und mehr in den Fokus der Forschung. In diesem Zusammenhang zielt die vorliegende Arbeit darauf ab, die Aussagekraft von SNP-Markern auf verschiedene Weise zu bewerten, wobei der Schwerpunkt auf Hühnern als einer vielfältigen Nutztierart mit großer landwirtschaftlicher Bedeutung liegt. In Kapitel 1 wird der aktuelle Wissensstand über genomische Variation, Markertechnologien und deren Einsatz in der Nutztierwissenschaft, insbesondere bei Hühnern, dargestellt. Kapitel 2 und 3 befassen sich dann mit einem systematischen Fehler von SNP-Arrays, dem SNP Ascertainment Bias. Der SNP Ascertainment Bias ist eine systematische Verschiebung des Allelfrequenzspektrums von SNP-Arrays hin zu häufigeren SNPs aufgrund der Vorauswahl von SNPs in einer begrenzten Anzahl von Individuen aus wenigen Populationen. Kapitel 2 zielt darauf ab, das Ausmaß des Bias für einen Standard-SNP-Array für Hühner und die Schritte des Array-Designs, die den Bias verursacht haben, zu bewerten. In der Studie haben wir daher den Designprozess des Hühnerarrays auf der Grundlage von (gepoolten) WGS verschiedener Hühnerpopulationen nachgestellt. Dabei zeigte sich eine sequentielle Reduktion seltener Allele während des Designprozesses, die vor allem durch die anfängliche Begrenzung des Discovery Sets und eine spätere Selektion von häufigen SNPs innerhalb der Populationen bei gleichzeitigem anstreben von äquidistanten Abständen verursacht wurde. Eine Vergrößerung des Discovery Panels hatte den größten Einfluss auf eine Begrenzung des Ascertainment Bias. Andere Schritte, wie z. B. die Validierung der SNPs in einem breiteren Set von Populationen, zeigten keine relevanten Auswirkungen. Korrekturmethoden für den Ascertainment Bias sind in Studien bisher meist nicht durchführbar. In Kapitel 3 wird daher vorgeschlagen, die Imputation der Array-Daten auf WGS-Niveau als in silico Korrekturmethode für das Allelfrequenzspektrum zu verwenden. Die Studie zeigte, dass die Imputation in der Lage ist, die Auswirkungen von Erhebungsfehlern stark zu reduzieren, selbst wenn ein sehr kleines Referenzpanel verwendet wurde. Es wurde jedoch auch deutlich, dass das Referenzpanel dann den gleichen Effekt wie das Discovery-Panel während des Array-Designs hat. Daher ist es von entscheidender Bedeutung, dass die Proben für das Referenzpanel gleichmäßig über das Populationsspektrum verteilt ausgewählt werden. SVs sind schwieriger zu bestimmen und zu genotypisieren als SNPs. Daher stellt sich die Frage, ob die Effekte von SV auch durch SNP-basierte Studien erfasst werden. Das wäre der Fall, wenn zwischen SNPs und SVs ein starkes Kopplungsungleichgewicht (LD) besteht. Dies wird in Kapitel 4 für drei kommerzielle Hühnerrassen auf der Grundlage von WGS-Daten untersucht. Die Studie zeigte, dass das LD zwischen Deletionen und SNPs auf dem gleichen Niveau lag wie das LD zwischen SNPs und anderen SNPs, was darauf hindeutet, dass Effekte von Deletionen von SNP-Marker-Panels genauso gut erfasst werden wie SNP-Effekte. Das LD zwischen SNPs und anderen SVs war stark reduziert. Der Hauptfaktor für diese Verringerung waren lokale Unterschiede zu SNPs in Bezug auf die Minor-Allel-Frequenz. Eine Reduktion der homozygoten Varianten für Nicht-Deletions-SVs im Vergleich zur Erwartung unter Hardy-Weinberg-Gleichgewicht kann jedoch auf Probleme der verwendeten SV-Genotypisierer hinweisen. Im letzten Kapitel (Kapitel 5) werden die Auswirkungen des Ascertainment Bias und die Möglichkeiten, damit in der Hühnergenomforschung (und auch generell in der Nutztiergenomforschung) umzugehen, diskutiert. Außerdem werden die Möglichkeiten der Einbeziehung von SV in Studien bewertet. Es wird auch erörtert, was notwendig ist, um die Informationen aus verschiedenen genomischen Datensätzen zu kombinieren damit der Aussagewert von Studien erhöht wird. Abschließend wird ein Ausblick darauf gegeben, welche Informationen aufgrund der jüngsten technologischen Fortschritte in naher Zukunft zusätzlich verfügbar sein werden.A key research question in livestock research is how livestock’s phenotypic diversity is shaped by its genomic diversity. Genomic diversity is thereby assessed through genomic markers. The use and definition of genomic markers is strongly technology driven and therefore changes through time. During the last years, single nucleotide polymorphisms (SNPs) have become the main marker class. Additionally, SNP arrays have been the genotyping technology of choice during the last years due to their early availability. They are, however, currently partially displaced by whole-genome-sequencing (WGS) for SNP calling. Further, structural variants (SV) are moving more and more into the focus of researchers. In this context, the thesis aims in evaluating the value of SNP markers in various ways with its main focus on chickens as a diverse livestock species with major agricultural value. In Chapter 1, the current knowledge of genomic variation, marker technologies, and their use in livestock sciences, especially in chickens, is reviewed. Chapter 2 and 3 then address a systematic error of SNP arrays, the SNP ascertainment bias. SNP ascertainment bias is a systematic shift of the allele frequency spectrum of SNP arrays towards more common SNPs due to the pre-selection of SNPs in a limited number of individuals of few populations. Chapter 2 aims in assessing the magnitude of the bias for a standard chicken SNP array and the steps of array design that created the bias. In the study, we therefore remodeled the design process of the chicken array based on (pooled) WGS of various chicken populations. This revealed a sequential reduction of rare alleles during the design process, which was mainly caused by the initial limitation of the discovery set and a later within-population selection of common SNPs while aiming for equidistant spacing. Increasing the discovery set had the largest impact on limiting ascertainment bias. Other steps, as e.g. validation of the SNPs in a broader set of populations did not show relevant effects. Correction methods for ascertainment bias are by now often unfeasible in studies. Chapter 3 therefore proposes to use imputation of the array data to WGS level as an in silico correction method of the allele frequency spectrum. The study revealed that imputation is able to strongly reduce the effects of ascertainment bias, even when a very sparse reference panel was used. However, it became also obvious that the reference panel then has the same effect as the discovery panel during array design. It is therefore crucial to select samples for the reference panel evenly spaced across the intended range of populations. SVs are harder to call and genotype than SNPs. Therefore, the question arises whether effects of SV are captured by SNP-based studies due to strong linkage disequilibrium between SNPs and SVs. This is assessed in Chapter 4 for three commercial chicken breeds, based on WGS data. The study showed that LD between deletions and SNPs was on the same level as LD between SNPs and other SNPs, indicating that deletion effects are captured by SNP marker panels as good as SNP effects. LD between SNPs and other SVs was strongly reduced. The main factor for this reduction was local differences to SNPs in terms of minor allele frequency. However, a reduction of homozygous variant calls for non-deletion SVs compared to the Hardy-Weinberg-expectation may indicate problems of the used SV genotypers. In the last chapter (Chapter 5), the impact of ascertainment bias and possibilities to deal with it in chicken genomics (and also more general in livestock genomics) is discussed. Further, the potentials of including SVs into studies are evaluated. It also discusses what is necessary to combine the information of different genomic data sets to leverage the value of analyses. Finally, an outlook on what information will be additionally available in near future based on recent technological advances is given.2022-01-1

    Identification of gene-gene interactions for Alzheimer's disease using co-operative game theory

    Full text link
    Thesis (Ph.D.)--Boston UniversityThe multifactorial nature of Alzheimer's Disease suggests that complex gene-gene interactions are present in AD pathways. Contemporary approaches to detect such interactions in genome-wide data are mathematically and computationally challenging. We investigated gene-gene interactions for AD using a novel algorithm based on cooperative game theory in 15 genome-wide association study (GWAS) datasets comprising of a total of 11,840 AD cases and 10,931 cognitively normal elderly controls from the Alzheimer Disease Genetics Consortium (ADGC). We adapted this approach, which was developed originally for solving multi-dimensional problems in economics and social sciences, to compute a Shapely value statistic to identify genetic markers that contribute most to coalitions of SNPs in predicting AD risk. Treating each GWAS dataset as independent discovery, markers were ranked according to their contribution to coalitions formed with other markers. Using a backward elimination strategy, markers with low Shapley values were eliminated and the statistic was recalculated iteratively. We tested all two-way interactions between top Shapley markers in regression models which included the two SNPs (main effects) and a term for their interaction. Models yielding a p-value<0.05 for the interaction term were evaluated in each of the other datasets and the results from all datasets were combined by meta-analysis. Statistically significant interactions were observed with multiple marker combinations in the APOE regions. My analyses also revealed statistically strong interactions between markers in 6 regions; CTNNA3-ATP11A (p=4.1E-07), CSMD1-PRKCQ (p=3.5E-08), DCC-UNC5CL (p=5.9e-8), CNTNAP2-RFC3 (p=1.16e-07), AACS-TSHZ3 (p=2.64e-07) and CAMK4-MMD (p=3.3e-07). The Shapley value algorithm outperformed Chi-Square and ReliefF in detecting known interactions between APOE and GAB2 in a previously published GWAS dataset. It was also more accurate than competing filtering methods in identifying simulated epistastic SNPs that are additive in nature, but its accuracy was low in identifying non-linear interactions. The game theory algorithm revealed strong interactions between markers in novel genes with weak main effects, which would have been overlooked if only markers with strong marginal association with AD were tested. This method will be a valuable tool for identifying gene-gene interactions for complex diseases and other traits

    Deep-coverage whole genome sequences and blood lipids among 16,324 individuals.

    Get PDF
    Large-scale deep-coverage whole-genome sequencing (WGS) is now feasible and offers potential advantages for locus discovery. We perform WGS in 16,324 participants from four ancestries at mean depth &gt;29X and analyze genotypes with four quantitative traits-plasma total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. Common variant association yields known loci except for few variants previously poorly imputed. Rare coding variant association yields known Mendelian dyslipidemia genes but rare non-coding variant association detects no signals. A high 2M-SNP LDL-C polygenic score (top 5th percentile) confers similar effect size to a monogenic mutation (~30 mg/dl higher for each); however, among those with severe hypercholesterolemia, 23% have a high polygenic score and only 2% carry a monogenic mutation. At these sample sizes and for these phenotypes, the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia
    corecore