Search CORE

arXiv.org e-Print Archive

Kernel-based aggregation of marker-level genetic association tests involving copy-number variation

Author: Breheny Patrick
Li Yinglei
Publication venue: 'MDPI AG'
Publication date: 29/07/2013
Field of study

Genetic association tests involving copy-number variants (CNVs) are complicated by the fact that CNVs span multiple markers at which measurements are taken. The power of an association test at a single marker is typically low, and it is desirable to pool information across the markers spanned by the CNV. However, CNV boundaries are not known in advance, and the best way to proceed with this pooling is unclear. In this article, we propose a kernel-based method for aggregation of marker-level tests and explore several aspects of its implementation. In addition, we explore some of the theoretical aspects of marker-level test aggregation, proposing a permutation-based approach that preserves the family-wise error rate of the testing procedure, while demonstrating that several simpler alternatives fail to do so. The empirical power of the approach is studied in a number of simulations constructed from real data involving a pharmacogenomic study of gemcitabine, and compares favorably with several competing approaches

Multidisciplinary Digital Publishing Institute

Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications

Author: Wu Xiao-Lin
Xu Jiaqi
Feng Guofei
Wiggans George R.
Taylor Jeremy F.
He Jun
Qian Changsong
Qiu Jiansheng
Simpson Barry
Walker Jeremy
Bauck Stewart
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 03/02/2015
Field of study

Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with more computing time. Nevertheless, the differences diminished when \u3e5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with \u3e3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal

OpenEdition

Power analysis for genome-wide association studies

Author: A Herbert
AD Skol
CS Carlson
D Gordon
DA Hinds
DE Arking
DM Maraganore
E Jorgenson
H Matsuzaki
I Pe'er
JC Barrett
JK Pritchard
JN Hirschhorn
K Roeder
KL Gunderson
N Risch
PI de Bakker
RH Duerr
RJ Klein
Robert J Klein
S Lin
SK Mitra
The International HapMap Consortium
X Ke
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Genome-wide association studies are a promising new tool for deciphering the genetics of complex diseases. To choose the proper sample size and genotyping platform for such studies, power calculations that take into account genetic model, tag SNP selection, and the population of interest are required. Results The power of genome-wide association studies can be computed using a set of tag SNPs and a large number of genotyped SNPs in a representative population, such as available through the HapMap project. As expected, power increases with increasing sample size and effect size. Power also depends on the tag SNPs selected. In some cases, more power is obtained by genotyping more individuals at fewer SNPs than fewer individuals at more SNPs. Conclusion Genome-wide association studies should be designed thoughtfully, with the choice of genotyping platform and sample size being determined from careful power calculations.</p

Springer - Publisher Connector

Estimates of array and pool-construction variance for planning efficient DNA-pooling genome wide association studies

Author: A Jawaid
AD Skol
Angela Brooks-Wilson
BJ Barratt
CF Skibola
F Capon
FJ Steemers
I Schrauwen
JE Craig
JN Hirschhorn
JT Leek
JV Pearson
K Kuhn
Kevin Chew
KM Brown
LA Hindorff
M Comabella
MA Bostrom
Madalene A Earp
Maziar Rahmani
MI McCarthy
P Sham
PM Visscher
R Abraham
RP Stokowski
S Macgregor
S Macgregor
S Macgregor
TA Pearson
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in Phase I of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency is estimated from a DNA pool will influence a study's power to detect associations. Here we address how to control the variance in allele frequency estimation when DNAs are pooled, and how to plan and conduct the most efficient well-powered pool-based GWAS. Methods By examining the variation in allele frequency estimation on SNP arrays between and within DNA pools we determine how array variance [var(earray)] and pool-construction variance [var(econstruction)] contribute to the total variance of allele frequency estimation. This information is useful in deciding whether replicate arrays or replicate pools are most useful in reducing variance. Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples, genotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad. Results For all three Illumina SNP array types our estimates of var(earray) were similar, between 3-4 × 10-4 for normalized data. Var(econstruction) accounted for between 20-40% of pooling variance across 27 pools in normalized data. Conclusions We conclude that relative to var(earray), var(econstruction) is of less importance in reducing the variance in allele frequency estimation from DNA pools; however, our data suggests that on average it may be more important than previously thought. We have prepared a simple online tool, PoolingPlanner (available at <url>http://www.kchew.ca/PoolingPlanner/</url>), which calculates the effective sample size (ESS) of a DNA pool given a range of replicate array values. ESS can be used in a power calculator to perform pool-adjusted calculations. This allows one to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.</p

Springer - Publisher Connector

Simon Fraser University Institutional Repository

Performance evaluation of DNA copy number segmentation methods

Author: Neuvial Pierre
Pierre-Jean Morgane
Rigaill Guillem
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling real SNP microarray data from genomic regions with known copy-number state. The original real data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. In this paper, we describe this framework and illustrate some of the benefits of the proposed data generation approach on a practical use case: a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons for the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. Availability: R package jointSeg: http://r-forge.r-project.org/R/?group\_id=156

arXiv.org e-Print Archive

Statistical and Computational Methods for Genome-Wide Association Analysis

Author: Quick Corbin
Publication venue
Publication date: 01/01/2018
Field of study

Technological and scientific advances in recent years have revolutionized genomics. For example, decreases in whole genome sequencing (WGS) costs have enabled larger WGS studies as well as larger imputation reference panels, which in turn provide more comprehensive genomic coverage from lower-cost genotyping methods. In addition, new technologies and large collaborative efforts such as ENCODE and GTEx have shed new light on regulatory genomics and the function of non-coding variation, and produced expansive publicly available data sets. These advances have introduced data of unprecedented size and dimension, unique statistical and computational challenges, and numerous opportunities for innovation. In this dissertation, we develop methods to leverage functional genomics data in post-GWAS analysis, to expedite routine computations with increasingly large genetic data sets, and to address limitations of current imputation reference panels for understudied populations. In Chapter 2, we propose strategies to improve imputation and increase power in GWAS of understudied populations. Genotype imputation is instrumental in GWAS, providing increased genomic coverage from low-cost genotyping arrays. Imputation quality depends crucially on reference panel size and the genetic distance between reference and target haplotypes. Current reference panels provide excellent imputation quality in many European populations, but lower quality in non-European, admixed, and isolate populations. We consider a GWAS strategy in which a subset of participants is sequenced and the rest are imputed using a reference panel that comprises the sequenced participants together with individuals from an external reference panel. Using empirical data from the HRC and TOPMed WGS Project, simulations, and asymptotic analysis, we identify powerful and cost-effective study designs for GWAS of non-European, admixed, and isolated populations. In Chapter 3, we develop efficient methods to estimate linkage disequilibrium (LD) with large data sets. Motivated by practical and logistical constraints, a variety of statistical methods and tools have been developed for analysis of GWAS summary statistics rather than individual-level data. These methods often rely on LD estimates from an external reference panel, which are ideally calculated on-the-fly rather than precomputed and stored. We develop efficient algorithms to estimate LD exploiting sparsity and haplotype structure and implement our methods in an open-source C++ tool, emeraLD. We benchmark performance using genotype data from the 1KGP, HRC, and UK Biobank, and find that emeraLD is up to two orders of magnitude faster than existing tools while using comparable or less memory. In Chapter 4, we develop methods to identify causative genes and biological mechanisms underlying associations in post-GWAS analysis by leveraging regulatory and functional genomics databases. Many gene-based association tests can be viewed as instrumental variable methods in which intermediate phenotypes, e.g. tissue-specific expression or protein alteration, are hypothesized to mediate the association between genotype and GWAS trait. However, LD and pleiotropy can confound these statistics, which complicates their mechanistic interpretation. We develop a hierarchical Bayesian model that accounts for multiple potential mechanisms underlying associations using functional genomic annotations derived from GTEx, Roadmap/ENCODE, and other sources. We apply our method to analyze twenty-five complex traits using GWAS summary statistics from UK Biobank, and provide an open-source implementation of our methods. In Chapter 5, we review our work, discuss its relevance and prospects as new resources emerge, and suggest directions for future research.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147697/1/corbinq_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

Polygenic risk associated with post-traumatic stress disorder onset and severity.

Author: Abu-Amara Duna
Doyle Francis J
Flory Janine D
Guffanti Guia
Jett Marti
Lori Adriana
Marmar Charles R
Misganaw Burook
Mueller Susanne
Ressler Kerry J
SBPBC
Yehuda Rachel
Publication venue: eScholarship, University of California
Publication date: 01/06/2019
Field of study

Post-traumatic stress disorder (PTSD) is a psychiatric illness with a highly polygenic architecture without large effect-size common single-nucleotide polymorphisms (SNPs). Thus, to capture a substantial portion of the genetic contribution, effects from many variants need to be aggregated. We investigated various aspects of one such approach that has been successfully applied to many traits, polygenic risk score (PRS) for PTSD. Theoretical analyses indicate the potential prediction ability of PRS. We used the latest summary statistics from the largest published genome-wide association study (GWAS) conducted by Psychiatric Genomics Consortium for PTSD (PGC-PTSD). We found that the PRS constructed for a cohort comprising veterans of recent wars (n = 244) explains a considerable proportion of PTSD onset (Nagelkerke R2 = 4.68%, P = 0.003) and severity (R2 = 4.35%, P = 0.0008) variances. However, the performance on an African ancestry sub-cohort was minimal. A PRS constructed with schizophrenia GWAS also explained a significant fraction of PTSD diagnosis variance (Nagelkerke R2 = 2.96%, P = 0.0175), confirming previously reported genetic correlation between the two psychiatric ailments. Overall, these findings demonstrate the important role polygenic analyses of PTSD will play in risk prediction models as well as in elucidating the biology of the disorder

eScholarship - University of California

Genetic Association Testing of Copy Number Variation

Author: Li Yinglei
Publication venue: UKnowledge
Publication date: 01/01/2014
Field of study

Copy-number variation (CNV) has been implicated in many complex diseases. It is of great interest to detect and locate such regions through genetic association testings. However, the association testings are complicated by the fact that CNVs usually span multiple markers and thus such markers are correlated to each other. To overcome the difficulty, it is desirable to pool information across the markers. In this thesis, we propose a kernel-based method for aggregation of marker-level tests, in which first we obtain a bunch of p-values through association tests for every marker and then the association test involving CNV is based on the statistic of p-values combinations. In addition, we explore several aspects of its implementation. Since p-values among markers are correlated, it is complicated to obtain the null distribution of test statistics for kernel-base aggregation of marker-level tests. To solve the problem, we develop two proper methods that are both demonstrated to preserve the family-wise error rate of the test procedure. They are permutation based and correlation base approaches. Many implementation aspects of kernel-based method are compared through the empirical power studies in a number of simulations constructed from real data involving a pharmacogenomic study of gemcitabine. In addition, more performance comparisons are shown between permutation-based and correlation-based approach. We also apply those two approaches to the real data. The main contribution of the dissertation is the development of marker-level association testing, a comparable and powerful approach to detect phenotype-associated CNVs. Furthermore, the approach is extended to high dimension setting with high efficiency

University of Kentucky

Concept, Design and Implementation of a Cardiovascular Gene-Centric 50 K SNP Array for Large-Scale Genomic Association Studies

Author: Ajmal Saad
Anand Sonia S.
Bailey Swneke D.
Barrett Jeffrey C.
Bhangale Tushar
Boehnke Michael
Boerwinkle Eric
Cappola Thomas P.
Caulfield Mark
Chandrupatla Hareesh R.
Chiang Charleston W. K.
de Bakker Paul I Wen
DerOhannessian Stephanie
Drake Thomas
Edmondson Andrew C.
Engert James C.
Fabsitz Richard R.
Farlow Deborah N.
FitzGerald Garret A.
Fornage Myriam
Frackelton Edward
Gabriel Stacey B.
Gai Xiaowu
Galver Luana
Glessner Joseph T.
Grant Struan F. A.
Groop Leif
Guo Yiran
Hakonarson Hakon
Hall Alistair S.
Hansen Mark
Hattersley Andrew T.
Hirschhorn Joel Naom
Kathiresan Sekar
Keating Brendan J.
Kim Cecelia E.
Koenig Wolfgang
Li Mingyao
Lusis A. Jake
McCarthy Mark I.
Montpetit Alexandre
Munroe Patricia
Murray Sarah S.
Nickerson Deborah A.
Ouwehand Willem
Papanicolaou George J.
Patterson Nick
Price Alkes
Price Thomas S.
Rader Daniel J.
Reich David Emil
Reilly Muredach
Reitsma Pieter H.
Samani Nilesh J.
Schadt Eric
Shaikh Tamim
Taylor Kent
Tischfield Sam
Wang Susanna S.
Whitehead A. Stephen
Wilson James G.
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ∼2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions

Queen Mary Research Online

DigitalCommons@The Texas Medical Center

Hal-Diderot

Public Library of Science (PLOS)

Harvard University - DASH

LSHTM Research Online

Springer - Publisher Connector

HAL-Inserm