522 research outputs found
Enhanced methods for local ancestry assignment in sequenced admixed individuals.
Inferring the ancestry at each locus in the genome of recently admixed individuals (e.g., Latino Americans) plays a major role in medical and population genetic inferences, ranging from finding disease-risk loci, to inferring recombination rates, to mapping missing contigs in the human genome. Although many methods for local ancestry inference have been proposed, most are designed for use with genotyping arrays and fail to make use of the full spectrum of data available from sequencing. In addition, current haplotype-based approaches are very computationally demanding, requiring large computational time for moderately large sample sizes. Here we present new methods for local ancestry inference that leverage continent-specific variants (CSVs) to attain increased performance over existing approaches in sequenced admixed genomes. A key feature of our approach is that it incorporates the admixed genomes themselves jointly with public datasets, such as 1000 Genomes, to improve the accuracy of CSV calling. We use simulations to show that our approach attains accuracy similar to widely used computationally intensive haplotype-based approaches with large decreases in runtime. Most importantly, we show that our method recovers comparable local ancestries, as the 1000 Genomes consensus local ancestry calls in the real admixed individuals from the 1000 Genomes Project. We extend our approach to account for low-coverage sequencing and show that accurate local ancestry inference can be attained at low sequencing coverage. Finally, we generalize CSVs to sub-continental population-specific variants (sCSVs) and show that in some cases it is possible to determine the sub-continental ancestry for short chromosomal segments on the basis of sCSVs
Identification of causal genes for complex traits.
MotivationAlthough genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.ResultsIn this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.Availability and implementationSoftware is freely available for download at genetics.cs.ucla.edu/caviar
Accurate estimation of SNP-heritability from biobank-scale data irrespective of genetic architecture.
SNP-heritability is a fundamental quantity in the study of complex traits. Recent studies have shown that existing methods to estimate genome-wide SNP-heritability can yield biases when their assumptions are violated. While various approaches have been proposed to account for frequency- and linkage disequilibrium (LD)-dependent genetic architectures, it remains unclear which estimates reported in the literature are reliable. Here we show that genome-wide SNP-heritability can be accurately estimated from biobank-scale data irrespective of genetic architecture, without specifying a heritability model or partitioning SNPs by allele frequency and/or LD. We show analytically and through extensive simulations starting from real genotypes (UK Biobank, N = 337 K) that, unlike existing methods, our closed-form estimator is robust across a wide range of architectures. We provide estimates of SNP-heritability for 22 complex traits in the UK Biobank and show that, consistent with our results in simulations, existing biobank-scale methods yield estimates up to 30% different from our theoretically-justified approach
Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Imputation using external reference panels is a widely used approach for
increasing power in GWAS and meta-analysis. Existing HMM-based imputation
approaches require individual-level genotypes. Here, we develop a new method
for Gaussian imputation from summary association statistics, a type of data
that is becoming widely available. In simulations using 1000 Genomes (1000G)
data, this method recovers 84% (54%) of the effective sample size for common
(>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary
LD information is available from target samples) versus 89% (67%) for HMM-based
imputation, which cannot be applied to summary statistics. Our approach
accounts for the limited sample size of the reference panel, a crucial step to
eliminate false-positive associations, and is computationally very fast. As an
empirical demonstration, we apply our method to 7 case-control phenotypes from
the WTCCC data and a study of height in the British 1958 birth cohort (1958BC).
Gaussian imputation from summary statistics recovers 95% (105%) of the
effective sample size (as quantified by the ratio of association
statistics) compared to HMM-based imputation from individual-level genotypes at
the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition,
for publicly available summary statistics from large meta-analyses of 4 lipid
traits, we publicly release imputed summary statistics at 1000G SNPs, which
could not have been obtained using previously published methods, and
demonstrate their accuracy by masking subsets of the data. We show that 1000G
imputation using our approach increases the magnitude and statistical evidence
of enrichment at genic vs. non-genic loci for these traits, as compared to an
analysis without 1000G imputation. Thus, imputation of summary statistics will
be a valuable tool in future functional enrichment analyses.Comment: 32 pages, 4 figure
Genotyping common and rare variation using overlapping pool sequencing
<p>Abstract</p> <p>Background</p> <p>Recent advances in sequencing technologies set the stage for large, population based studies, in which the ANA or RNA of thousands of individuals will be sequenced. Currently, however, such studies are still infeasible using a straightforward sequencing approach; as a result, recently a few multiplexing schemes have been suggested, in which a small number of ANA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare variants.</p> <p>Results</p> <p>In this paper we provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming. The approach allows for the addition of external data, particularly imputation data, resulting in a flexible environment that is suitable for different applications.</p> <p>Conclusions</p> <p>Particularly, we demonstrate that both low and high allele frequency SNPs can be accurately genotyped when the DNA pooling scheme is performed in conjunction with microarray genotyping and imputation. Additionally, we demonstrate the use of our framework for the detection of cancer fusion genes from RNA sequences.</p
Twas_SIM, a Python-Based Tool for Simulation and Power Analysis of Transcriptome-Wide Association Analysis
Genome-wide association studies (GWASs) have identified numerous genetic variants associated with complex disease risk; however, most of these associations are non-coding, complicating identifying their proximal target gene. Transcriptome-wide association studies (TWASs) have been proposed to mitigate this gap by integrating expression quantitative trait loci (eQTL) data with GWAS data. Numerous methodological advancements have been made for TWAS, yet each approach requires ad hoc simulations to demonstrate feasibility. Here, we present twas_sim, a computationally scalable and easily extendable tool for simplified performance evaluation and power analysis for TWAS methods.
Software and documentation are available at https://github.com/mancusolab/twas_sim
Extending Admixture Mapping to Nuclear Pedigrees:Application to Sarcoidosis
We describe statistical methods that extend the application of admixture mapping from unrelated individuals to nuclear pedigrees, allowing existing pedigree-based collections to be fully exploited. Computational challenges have been overcome by developing a fast algorithm that exploits the factorial structure of the underlying model of ancestry transitions. This has been implemented as an extension of the program ADMIXMAP. We demonstrate the application of the method to a study of sarcoidosis in African Americans that has previously been analyzed only as an admixture mapping study restricted to unrelated individuals. Although the ancestry signals detected in this pedigree analysis are generally similar to those detected in the earlier analysis of unrelated cases, we are able to extract more information and this yields a much sharper exclusion map; using the classical criterion of an LOD score of minus 2, the pedigree analysis is able to exclude a risk ratio of 2 or more associated with African ancestry over 96% of the genome, compared with only 83% in the earlier analysis of unrelated individuals only. Although the pedigree extension of ADMIXMAP can use ancestry-informative markers only at relatively low density, it can use imputed ancestry states from programs such as WINPOP or HAPMIX that use dense SNP marker genotypes for admixture mapping. This extends both the efficiency and the range of application of this powerful gene mapping method
- …