2,274 research outputs found

    Methodological Issues in Multistage Genome-Wide Association Studies

    Full text link
    Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of ``promising'' SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a ``replication'' panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent ``exact replication'' study is needed in a similar population of the same promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Kernel-based aggregation of marker-level genetic association tests involving copy-number variation

    Full text link
    Genetic association tests involving copy-number variants (CNVs) are complicated by the fact that CNVs span multiple markers at which measurements are taken. The power of an association test at a single marker is typically low, and it is desirable to pool information across the markers spanned by the CNV. However, CNV boundaries are not known in advance, and the best way to proceed with this pooling is unclear. In this article, we propose a kernel-based method for aggregation of marker-level tests and explore several aspects of its implementation. In addition, we explore some of the theoretical aspects of marker-level test aggregation, proposing a permutation-based approach that preserves the family-wise error rate of the testing procedure, while demonstrating that several simpler alternatives fail to do so. The empirical power of the approach is studied in a number of simulations constructed from real data involving a pharmacogenomic study of gemcitabine, and compares favorably with several competing approaches

    Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications

    Get PDF
    Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with more computing time. Nevertheless, the differences diminished when \u3e5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with \u3e3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal

    Power analysis for genome-wide association studies

    Get PDF
    Abstract Background Genome-wide association studies are a promising new tool for deciphering the genetics of complex diseases. To choose the proper sample size and genotyping platform for such studies, power calculations that take into account genetic model, tag SNP selection, and the population of interest are required. Results The power of genome-wide association studies can be computed using a set of tag SNPs and a large number of genotyped SNPs in a representative population, such as available through the HapMap project. As expected, power increases with increasing sample size and effect size. Power also depends on the tag SNPs selected. In some cases, more power is obtained by genotyping more individuals at fewer SNPs than fewer individuals at more SNPs. Conclusion Genome-wide association studies should be designed thoughtfully, with the choice of genotyping platform and sample size being determined from careful power calculations.</p

    Estimates of array and pool-construction variance for planning efficient DNA-pooling genome wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in Phase I of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency is estimated from a DNA pool will influence a study's power to detect associations. Here we address how to control the variance in allele frequency estimation when DNAs are pooled, and how to plan and conduct the most efficient well-powered pool-based GWAS.</p> <p>Methods</p> <p>By examining the variation in allele frequency estimation on SNP arrays between and within DNA pools we determine how array variance [var(e<sub>array</sub>)] and pool-construction variance [var(e<sub>construction</sub>)] contribute to the total variance of allele frequency estimation. This information is useful in deciding whether replicate arrays or replicate pools are most useful in reducing variance. Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples, genotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad.</p> <p>Results</p> <p>For all three Illumina SNP array types our estimates of var(e<sub>array</sub>) were similar, between 3-4 × 10<sup>-4 </sup>for normalized data. Var(e<sub>construction</sub>) accounted for between 20-40% of pooling variance across 27 pools in normalized data.</p> <p>Conclusions</p> <p>We conclude that relative to var(e<sub>array</sub>), var(e<sub>construction</sub>) is of less importance in reducing the variance in allele frequency estimation from DNA pools; however, our data suggests that on average it may be more important than previously thought. We have prepared a simple online tool, PoolingPlanner (available at <url>http://www.kchew.ca/PoolingPlanner/</url>), which calculates the effective sample size (ESS) of a DNA pool given a range of replicate array values. ESS can be used in a power calculator to perform pool-adjusted calculations. This allows one to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.</p

    Performance evaluation of DNA copy number segmentation methods

    Full text link
    A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling real SNP microarray data from genomic regions with known copy-number state. The original real data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. In this paper, we describe this framework and illustrate some of the benefits of the proposed data generation approach on a practical use case: a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons for the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. Availability: R package jointSeg: http://r-forge.r-project.org/R/?group\_id=156

    Statistical and Computational Methods for Genome-Wide Association Analysis

    Full text link
    Technological and scientific advances in recent years have revolutionized genomics. For example, decreases in whole genome sequencing (WGS) costs have enabled larger WGS studies as well as larger imputation reference panels, which in turn provide more comprehensive genomic coverage from lower-cost genotyping methods. In addition, new technologies and large collaborative efforts such as ENCODE and GTEx have shed new light on regulatory genomics and the function of non-coding variation, and produced expansive publicly available data sets. These advances have introduced data of unprecedented size and dimension, unique statistical and computational challenges, and numerous opportunities for innovation. In this dissertation, we develop methods to leverage functional genomics data in post-GWAS analysis, to expedite routine computations with increasingly large genetic data sets, and to address limitations of current imputation reference panels for understudied populations. In Chapter 2, we propose strategies to improve imputation and increase power in GWAS of understudied populations. Genotype imputation is instrumental in GWAS, providing increased genomic coverage from low-cost genotyping arrays. Imputation quality depends crucially on reference panel size and the genetic distance between reference and target haplotypes. Current reference panels provide excellent imputation quality in many European populations, but lower quality in non-European, admixed, and isolate populations. We consider a GWAS strategy in which a subset of participants is sequenced and the rest are imputed using a reference panel that comprises the sequenced participants together with individuals from an external reference panel. Using empirical data from the HRC and TOPMed WGS Project, simulations, and asymptotic analysis, we identify powerful and cost-effective study designs for GWAS of non-European, admixed, and isolated populations. In Chapter 3, we develop efficient methods to estimate linkage disequilibrium (LD) with large data sets. Motivated by practical and logistical constraints, a variety of statistical methods and tools have been developed for analysis of GWAS summary statistics rather than individual-level data. These methods often rely on LD estimates from an external reference panel, which are ideally calculated on-the-fly rather than precomputed and stored. We develop efficient algorithms to estimate LD exploiting sparsity and haplotype structure and implement our methods in an open-source C++ tool, emeraLD. We benchmark performance using genotype data from the 1KGP, HRC, and UK Biobank, and find that emeraLD is up to two orders of magnitude faster than existing tools while using comparable or less memory. In Chapter 4, we develop methods to identify causative genes and biological mechanisms underlying associations in post-GWAS analysis by leveraging regulatory and functional genomics databases. Many gene-based association tests can be viewed as instrumental variable methods in which intermediate phenotypes, e.g. tissue-specific expression or protein alteration, are hypothesized to mediate the association between genotype and GWAS trait. However, LD and pleiotropy can confound these statistics, which complicates their mechanistic interpretation. We develop a hierarchical Bayesian model that accounts for multiple potential mechanisms underlying associations using functional genomic annotations derived from GTEx, Roadmap/ENCODE, and other sources. We apply our method to analyze twenty-five complex traits using GWAS summary statistics from UK Biobank, and provide an open-source implementation of our methods. In Chapter 5, we review our work, discuss its relevance and prospects as new resources emerge, and suggest directions for future research.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147697/1/corbinq_1.pd

    Genetic Association Testing of Copy Number Variation

    Get PDF
    Copy-number variation (CNV) has been implicated in many complex diseases. It is of great interest to detect and locate such regions through genetic association testings. However, the association testings are complicated by the fact that CNVs usually span multiple markers and thus such markers are correlated to each other. To overcome the difficulty, it is desirable to pool information across the markers. In this thesis, we propose a kernel-based method for aggregation of marker-level tests, in which first we obtain a bunch of p-values through association tests for every marker and then the association test involving CNV is based on the statistic of p-values combinations. In addition, we explore several aspects of its implementation. Since p-values among markers are correlated, it is complicated to obtain the null distribution of test statistics for kernel-base aggregation of marker-level tests. To solve the problem, we develop two proper methods that are both demonstrated to preserve the family-wise error rate of the test procedure. They are permutation based and correlation base approaches. Many implementation aspects of kernel-based method are compared through the empirical power studies in a number of simulations constructed from real data involving a pharmacogenomic study of gemcitabine. In addition, more performance comparisons are shown between permutation-based and correlation-based approach. We also apply those two approaches to the real data. The main contribution of the dissertation is the development of marker-level association testing, a comparable and powerful approach to detect phenotype-associated CNVs. Furthermore, the approach is extended to high dimension setting with high efficiency

    Concept, Design and Implementation of a Cardiovascular Gene-Centric 50 K SNP Array for Large-Scale Genomic Association Studies

    Get PDF
    A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ∼2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions
    corecore