57 research outputs found

    Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation

    Get PDF
    Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data

    The variance of identity-by-descent sharing in the Wright-Fisher model

    Full text link
    Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced recent genetic drift. Detection of these IBD segments has recently become feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Specifically, using coalescent theory, we calculate the variance of the total sharing between random pairs of individuals. We then investigate the cohort-averaged sharing: the average total sharing between one individual and the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally. Surprisingly, the variance of this distribution does not vanish even for large cohorts, implying the existence of "hyper-sharing" individuals. The presence of such individuals has consequences for the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD, and subsequently, in power to detect an association, when individuals are either randomly selected or specifically chosen to be the hyper-sharing individuals. Using our framework, we also compute the variance of an estimator of the population size that is based on the mean IBD sharing and the variance in the sharing between inbred siblings. Finally, we study IBD sharing in an admixture pulse model, and show that in the Ashkenazi Jewish population the admixture fraction is correlated with the cohort-averaged sharing.Comment: Includes Supplementary Materia

    Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History

    Get PDF
    Data-driven studies of identity by descent (IBD) were recently enabled by high-resolution genomic data from large cohorts and scalable algorithms for IBD detection. Yet, haplotype sharing currently represents an underutilized source of information for population-genetics research. We present analytical results on the relationship between haplotype sharing across purportedly unrelated individuals and a populationā€™s demographic history. We express the distribution of IBD sharing across pairs of individuals for segments of arbitrary length as a function of the populationā€™s demography, and we derive an inference procedure to reconstruct such demographic history. The accuracy of the proposed reconstruction methodology was extensively tested on simulated data. We applied this methodology to two densely typed data sets: 500 Ashkenazi Jewish (AJ) individuals and 56 Kenyan Maasai (MKK) individuals (HapMap 3 data set). Reconstructing the demographic history of the AJ cohort, we recovered two subsequent population expansions, separated by a severe founder event, consistent with previous analysis of lower-throughput genetic data and historical accounts of AJ history. In the MKK cohort, high levels of cryptic relatedness were detected. The spectrum of IBD sharing is consistent with a demographic model in which several small-sized demes intermix through high migration rates and result in enrichment of shared long-range haplotypes. This scenario of historically structured demographies might explain the unexpected abundance of runs of homozygosity within several populations

    Integrative eQTL-Based Analyses Reveal the Biology of Breast Cancer Risk Loci

    Get PDF
    čÆ„č®ŗꖇę˜ÆåœØęœ¬ę–‡é€šč®Æä½œč€…ē¾Žå›½å“ˆä½›å¤§å­¦åŒ»å­¦é™¢ä»£ēŗ³ę³•ä¼Æē™Œē—‡äø­åæƒé©¬äæ®.å¼—é‡Œå¾·ę›¼ę•™ęŽˆå®žéŖŒå®¤å®Œęˆēš„怂Germline determinants of gene expression in tumors are infrequently studied due to the complexity of transcript regulation caused by somatically acquired alterations. We performed expression quantitative trait locus (eQTL)-based analyses using the multi-level information provided in The Cancer Genome Atlas (TCGA). Of the factors we measured, cis-acting eQTLs accounted for 1.2% of the total variation of tumor gene expression, while somatic copy-number alteration and CpG methylation accounted for 7.3% and 3.3%, respectively. eQTL analyses of 15 previously reported breast cancer risk loci resulted in the discovery of three variants that are significantly associated with transcript levels (false discovery rate [FDR] < 0.1). Our trans-based analysis identified an additional three risk loci to act through ESR1, MYC, and KLF4. These findings provide a more comprehensive picture of gene expression determinants in breast cancer as well as insights into the underlying biology of breast cancer risk loci

    Elevated GM3 plasma concentration in idiopathic Parkinsonā€™s disease: A lipidomic analysis

    Get PDF
    Parkinsonā€™s disease (PD) is a common neurodegenerative disease whose pathological hallmark is the accumulation of intracellular Ī±-synuclein aggregates in Lewy bodies. Lipid metabolism dysregulation may play a significant role in PD pathogenesis; however, large plasma lipidomic studies in PD are lacking. In the current study, we analyzed the lipidomic profile of plasma obtained from 150 idiopathic PD patients and 100 controls, taken from the ā€˜Spotā€™ study at Columbia University Medical Center in New York. Our mass spectrometry based analytical panel consisted of 520 lipid species from 39 lipid subclasses including all major classes of glycerophospholipids, sphingolipids, glycerolipids and sterols. Each lipid species was analyzed using a logistic regression model. The plasma concentrations of two lipid subclasses, triglycerides and monosialodihexosylganglioside (GM3), were different between PD and control participants. GM3 ganglioside concentration had the most significant difference between PD and controls (1.531Ā±0.037 pmol/Ī¼l versus 1.337Ā±0.040 pmol/Ī¼l respectively; p-value = 5.96E-04; q-value = 0.048; when normalized to total lipid: p-value = 2.890E-05; q-value = 2.933E-03). Next, we used a collection of 20 GM3 and glucosylceramide (GlcCer) species concentrations normalized to total lipid to perform a ROC curve analysis, and found that these lipids compare favorably with biomarkers reported in previous studies (AUC = 0.742 for males, AUC = 0.644 for females). Our results suggest that higher plasma GM3 levels are associated with PD. GM3 lies in the same glycosphingolipid metabolic pathway as GlcCer, a substrate of the enzyme glucocerebrosidase, which has been associated with PD. These findings are consistent with previous reports implicating lower glucocerebrosidase activity with PD risk

    A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data

    Get PDF
    Motivation: Copy Number Variants (CNVs) are important genetic factors for studying human diseases. While high-throughput whole genome re-sequencing provides multiple lines of evidence for detecting CNVs, computational algorithms need to be tailored for different type or size of CNVs under different experimental designs. Results: To achieve optimal power and resolution of detecting CNVs at low depth of coverage, we implemented a Hidden Markov Model that integrates both depth of coverage and mate-pair relationship. The novelty of our algorithm is that we infer the likelihood of carrying a deletion jointly from multiple mate pairs in a region without the requirement of a single mate pairs being obvious outliers. By integrating all useful information in a comprehensive model, our method is able to detect medium-size deletions (200-2000bp) at low depth (<10Ɨ per sample). We applied the method to simulated data and demonstrate the power of detecting medium-size deletions is close to theoretical values. Availability: A program implemented in Java, Zinfandel, is available at http://www.cs.columbia.edu/~itsik/zinfandel

    WGS-based telomere length analysis in Dutch family trios implicates stronger maternal inheritance and a role for RRM1 gene

    Get PDF
    Telomere length (TL) regulation is an important factor in ageing, reproduction and cancer development. Genetic, hereditary and environmental factors regulating TL are currently widely investigated, however, their relative contribution to TL variability is still understudied. We have used whole genome sequencing data of 250 family trios from the Genome of the Netherlands project to perform computational measurement of TL and a series of regression and genome-wide association analyses to reveal TL inheritance patterns and associated genetic factors. Our results confirm that TL is a largely heritable trait, primarily with motherā€™s, and, to a lesser extent, with fatherā€™s TL having the strongest influence on the offspring. In this cohort, motherā€™s, but not fatherā€™s age at conception was positively linked to offspring TL. Age-related TL attrition of 40 bp/year had relatively small influence on TL variability. Finally, we have identified TL-associated variations in ribonuclease reductase catalytic subunit M1 (RRM1 gene), which is known to regulate telomere maintenance in yeast. We also highlight the importance of multivariate approach and the limitations of existing tools for the analysis of TL as a polygenic heritable quantitative trait

    Distribution of papers across journals, for journals that had <i>at least one</i> article with sufficient information for analysis.

    No full text
    <p>The full distribution of all journals analyzed in the study, including those with all papers excluded, is in Table A in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1006916#pgen.1006916.s012" target="_blank">S1 File</a>.</p
    • ā€¦
    corecore