10,553 research outputs found
RNA-Seq optimization with eQTL gold standards.
BackgroundRNA-Sequencing (RNA-Seq) experiments have been optimized for library preparation, mapping, and gene expression estimation. These methods, however, have revealed weaknesses in the next stages of analysis of differential expression, with results sensitive to systematic sample stratification or, in more extreme cases, to outliers. Further, a method to assess normalization and adjustment measures imposed on the data is lacking.ResultsTo address these issues, we utilize previously published eQTLs as a novel gold standard at the center of a framework that integrates DNA genotypes and RNA-Seq data to optimize analysis and aid in the understanding of genetic variation and gene expression. After detecting sample contamination and sequencing outliers in RNA-Seq data, a set of previously published brain eQTLs was used to determine if sample outlier removal was appropriate. Improved replication of known eQTLs supported removal of these samples in downstream analyses. eQTL replication was further employed to assess normalization methods, covariate inclusion, and gene annotation. This method was validated in an independent RNA-Seq blood data set from the GTEx project and a tissue-appropriate set of eQTLs. eQTL replication in both data sets highlights the necessity of accounting for unknown covariates in RNA-Seq data analysis.ConclusionAs each RNA-Seq experiment is unique with its own experiment-specific limitations, we offer an easily-implementable method that uses the replication of known eQTLs to guide each step in one's data analysis pipeline. In the two data sets presented herein, we highlight not only the necessity of careful outlier detection but also the need to account for unknown covariates in RNA-Seq experiments
Robust identification of local adaptation from allele frequencies
Comparing allele frequencies among populations that differ in environment has
long been a tool for detecting loci involved in local adaptation. However, such
analyses are complicated by an imperfect knowledge of population allele
frequencies and neutral correlations of allele frequencies among populations
due to shared population history and gene flow. Here we develop a set of
methods to robustly test for unusual allele frequency patterns, and
correlations between environmental variables and allele frequencies while
accounting for these complications based on a Bayesian model previously
implemented in the software Bayenv. Using this model, we calculate a set of
`standardized allele frequencies' that allows investigators to apply tests of
their choice to multiple populations, while accounting for sampling and
covariance due to population history. We illustrate this first by showing that
these standardized frequencies can be used to calculate powerful tests to
detect non-parametric correlations with environmental variables, which are also
less prone to spurious results due to outlier populations. We then demonstrate
how these standardized allele frequencies can be used to construct a test to
detect SNPs that deviate strongly from neutral population structure. This test
is conceptually related to FST but should be more powerful as we account for
population history. We also extend the model to next-generation sequencing of
population pools, which is a cost-efficient way to estimate population allele
frequencies, but it implies an additional level of sampling noise. The utility
of these methods is demonstrated in simulations and by re-analyzing human SNP
data from the HGDP populations. An implementation of our method will be
available from http://gcbias.org.Comment: 27 pages, 7 figure
Recommended from our members
Exome resequencing and GWAS for growth, ecophysiology, and chemical and metabolomic composition of wood of Populus trichocarpa.
BackgroundPopulus trichocarpa is an important forest tree species for the generation of lignocellulosic ethanol. Understanding the genomic basis of biomass production and chemical composition of wood is fundamental in supporting genetic improvement programs. Considerable variation has been observed in this species for complex traits related to growth, phenology, ecophysiology and wood chemistry. Those traits are influenced by both polygenic control and environmental effects, and their genome architecture and regulation are only partially understood. Genome wide association studies (GWAS) represent an approach to advance that aim using thousands of single nucleotide polymorphisms (SNPs). Genotyping using exome capture methodologies represent an efficient approach to identify specific functional regions of genomes underlying phenotypic variation.ResultsWe identified 813 K SNPs, which were utilized for genotyping 461 P. trichocarpa clones, representing 101 provenances collected from Oregon and Washington, and established in California. A GWAS performed on 20 traits, considering single SNP-marker tests identified a variable number of significant SNPs (p-value < 6.1479E-8) in association with diameter, height, leaf carbon and nitrogen contents, and δ15N. The number of significant SNPs ranged from 2 to 220 per trait. Additionally, multiple-marker analyses by sliding-windows tests detected between 6 and 192 significant windows for the analyzed traits. The significant SNPs resided within genes that encode proteins belonging to different functional classes as such protein synthesis, energy/metabolism and DNA/RNA metabolism, among others.ConclusionsSNP-markers within genes associated with traits of importance for biomass production were detected. They contribute to characterize the genomic architecture of P. trichocarpa biomass required to support the development and application of marker breeding technologies
Evaluation of polygenic determinants of non-alcoholic fatty liver disease (NAFLD) by a candidate genes resequencing strategy
NAFLD is a polygenic condition but the individual and cumulative contribution of identified genes remains to be established. To get additional insight into the genetic architecture of NAFLD, GWAS-identified GCKR, PPP1R3B, NCAN, LYPLAL1 and TM6SF2 genes were resequenced by next generation sequencing in a cohort of 218 NAFLD subjects and 227 controls, where PNPLA3 rs738409 and MBOAT7 rs641738 genotypes were also obtained. A total of 168 sequence variants were detected and 47 were annotated as functional. When all functional variants within each gene were considered, only those in TM6SF2 accumulate in NAFLD subjects compared to controls (P = 0.04). Among individual variants, rs1260326 in GCKR and rs641738 in MBOAT7 (recessive), rs58542926 in TM6SF2 and rs738409 in PNPLA3 (dominant) emerged as associated to NAFLD, with PNPLA3 rs738409 being the strongest predictor (OR 3.12, 95% CI, 1.8-5.5, P 0.28 was associated with a 3-fold increased risk of NAFLD. Interestingly, rs61756425 in PPP1R3B and rs641738 in MBOAT7 genes were predictors of NAFLD severity. Overall, TM6SF2, GCKR, PNPLA3 and MBOAT7 were confirmed to be associated with NAFLD and a score based on these genes was highly predictive of this condition. In addition, PPP1R3B and MBOAT7 might influence NAFLD severity
Genome-wide association study in two cohorts from a multi-generational mouse advanced intercross line highlights the difficulty of replication due to study-specific heterogeneity
There has been extensive discussion of the Replication Crisis in many fields, including genome-wide association studies
The Decay of Disease Association with Declining Linkage Disequilibrium: A Fine Mapping Theorem
Several important and fundamental aspects of disease genetics models have yet to be described. One such property is the relationship of disease association statistics at a marker site closely linked to a disease causing site. A complete description of this two-locus system is of particular importance to experimental efforts to fine map association signals for complex diseases. Here, we present a simple relationship between disease association statistics and the decline of linkage disequilibrium from a causal site. Specifically, the ratio of Chi-square disease association statistics at a marker site and causal site is equivalent to the standard measure of pairwise linkage disequilibrium, r2. A complete derivation of this relationship from a general disease model is shown. Quite interestingly, this relationship holds across all modes of inheritance. Extensive Monte Carlo simulations using a disease genetics model applied to chromosomes subjected to a standard model of recombination are employed to better understand the variation around this fine mapping theorem due to sampling effects. We also use this relationship to provide a framework for estimating properties of a non-interrogated causal site using data at closely linked markers. Lastly, we apply this way of examining association data from high-density genotyping in a large, publicly-available data set investigating extreme BMI. We anticipate that understanding the patterns of disease association decay with declining linkage disequilibrium from a causal site will enable more powerful fine mapping methods and provide new avenues for identifying causal sites/genes from fine-mapping studies
Recommended from our members
Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy
The Genomic HyperBrowser: inferential genomics at the sequence level
The immense increase in the generation of genomic scale data poses an unmet
analytical challenge, due to a lack of established methodology with the
required flexibility and power. We propose a first principled approach to
statistical analysis of sequence-level genomic information. We provide a
growing collection of generic biological investigations that query pairwise
relations between tracks, represented as mathematical objects, along the
genome. The Genomic HyperBrowser implements the approach and is available at
http://hyperbrowser.uio.no
- …