59 research outputs found
Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data
The variance of identity-by-descent sharing in the Wright-Fisher model
Widespread sharing of long, identical-by-descent (IBD) genetic segments is a
hallmark of populations that have experienced recent genetic drift. Detection
of these IBD segments has recently become feasible, enabling a wide range of
applications from phasing and imputation to demographic inference. Here, we
study the distribution of IBD sharing in the Wright-Fisher model. Specifically,
using coalescent theory, we calculate the variance of the total sharing between
random pairs of individuals. We then investigate the cohort-averaged sharing:
the average total sharing between one individual and the rest of the cohort. We
find that for large cohorts, the cohort-averaged sharing is distributed
approximately normally. Surprisingly, the variance of this distribution does
not vanish even for large cohorts, implying the existence of "hyper-sharing"
individuals. The presence of such individuals has consequences for the design
of sequencing studies, since, if they are selected for whole-genome sequencing,
a larger fraction of the cohort can be subsequently imputed. We calculate the
expected gain in power of imputation by IBD, and subsequently, in power to
detect an association, when individuals are either randomly selected or
specifically chosen to be the hyper-sharing individuals. Using our framework,
we also compute the variance of an estimator of the population size that is
based on the mean IBD sharing and the variance in the sharing between inbred
siblings. Finally, we study IBD sharing in an admixture pulse model, and show
that in the Ashkenazi Jewish population the admixture fraction is correlated
with the cohort-averaged sharing.Comment: Includes Supplementary Materia
Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History
Data-driven studies of identity by descent (IBD) were recently enabled by high-resolution genomic data from large cohorts and scalable algorithms for IBD detection. Yet, haplotype sharing currently represents an underutilized source of information for population-genetics research. We present analytical results on the relationship between haplotype sharing across purportedly unrelated individuals and a populationās demographic history. We express the distribution of IBD sharing across pairs of individuals for segments of arbitrary length as a function of the populationās demography, and we derive an inference procedure to reconstruct such demographic history. The accuracy of the proposed reconstruction methodology was extensively tested on simulated data. We applied this methodology to two densely typed data sets: 500 Ashkenazi Jewish (AJ) individuals and 56 Kenyan Maasai (MKK) individuals (HapMap 3 data set). Reconstructing the demographic history of the AJ cohort, we recovered two subsequent population expansions, separated by a severe founder event, consistent with previous analysis of lower-throughput genetic data and historical accounts of AJ history. In the MKK cohort, high levels of cryptic relatedness were detected. The spectrum of IBD sharing is consistent with a demographic model in which several small-sized demes intermix through high migration rates and result in enrichment of shared long-range haplotypes. This scenario of historically structured demographies might explain the unexpected abundance of runs of homozygosity within several populations
Current status of artificial intelligence methods for skin cancer survival analysis: a scoping review
Skin cancer mortality rates continue to rise, and survival analysis is increasingly needed to understand who is at risk and what interventions improve outcomes. However, current statistical methods are limited by inability to synthesize multiple data types, such as patient genetics, clinical history, demographics, and pathology and reveal significant multimodal relationships through predictive algorithms. Advances in computing power and data science enabled the rise of artificial intelligence (AI), which synthesizes vast amounts of data and applies algorithms that enable personalized diagnostic approaches. Here, we analyze AI methods used in skin cancer survival analysis, focusing on supervised learning, unsupervised learning, deep learning, and natural language processing. We illustrate strengths and weaknesses of these approaches with examples. Our PubMed search yielded 14 publications meeting inclusion criteria for this scoping review. Most publications focused on melanoma, particularly histopathologic interpretation with deep learning. Such concentration on a single type of skin cancer amid increasing focus on deep learning highlight growing areas for innovation; however, it also demonstrates opportunity for additional analysis that addresses other types of cutaneous malignancies and expands the scope of prognostication to combine both genetic, histopathologic, and clinical data. Moreover, researchers may leverage multiple AI methods for enhanced benefit in analyses. Expanding AI to this arena may enable improved survival analysis, targeted treatments, and outcomes
Integrative eQTL-Based Analyses Reveal the Biology of Breast Cancer Risk Loci
čÆ„č®ŗęęÆåØę¬ęéč®Æä½č
ē¾å½åä½å¤§å¦å»å¦é¢ä»£ēŗ³ę³ä¼Æēēäøåæ马äæ®.å¼éå¾·ę¼ęęå®éŖ室å®ęēćGermline determinants of gene expression in tumors are infrequently studied due to the complexity of transcript regulation caused by somatically acquired alterations. We performed expression quantitative trait locus (eQTL)-based analyses using the multi-level information provided in The Cancer Genome Atlas (TCGA). Of the factors we measured, cis-acting eQTLs accounted for 1.2% of the total variation of tumor gene expression, while somatic copy-number alteration and CpG methylation accounted for 7.3% and 3.3%, respectively. eQTL analyses of 15 previously reported breast cancer risk loci resulted in the discovery of three variants that are significantly associated with transcript levels (false discovery rate [FDR] < 0.1). Our trans-based analysis identified an additional three risk loci to act through ESR1, MYC, and KLF4. These findings provide a more comprehensive picture of gene expression determinants in breast cancer as well as insights into the underlying biology of breast cancer risk loci
Elevated GM3 plasma concentration in idiopathic Parkinsonās disease: A lipidomic analysis
Parkinsonās disease (PD) is a common neurodegenerative disease whose pathological hallmark is the accumulation of intracellular Ī±-synuclein aggregates in Lewy bodies. Lipid metabolism dysregulation may play a significant role in PD pathogenesis; however, large plasma lipidomic studies in PD are lacking. In the current study, we analyzed the lipidomic profile of plasma obtained from 150 idiopathic PD patients and 100 controls, taken from the āSpotā study at Columbia University Medical Center in New York. Our mass spectrometry based analytical panel consisted of 520 lipid species from 39 lipid subclasses including all major classes of glycerophospholipids, sphingolipids, glycerolipids and sterols. Each lipid species was analyzed using a logistic regression model. The plasma concentrations of two lipid subclasses, triglycerides and monosialodihexosylganglioside (GM3), were different between PD and control participants. GM3 ganglioside concentration had the most significant difference between PD and controls (1.531Ā±0.037 pmol/Ī¼l versus 1.337Ā±0.040 pmol/Ī¼l respectively; p-value = 5.96E-04; q-value = 0.048; when normalized to total lipid: p-value = 2.890E-05; q-value = 2.933E-03). Next, we used a collection of 20 GM3 and glucosylceramide (GlcCer) species concentrations normalized to total lipid to perform a ROC curve analysis, and found that these lipids compare favorably with biomarkers reported in previous studies (AUC = 0.742 for males, AUC = 0.644 for females). Our results suggest that higher plasma GM3 levels are associated with PD. GM3 lies in the same glycosphingolipid metabolic pathway as GlcCer, a substrate of the enzyme glucocerebrosidase, which has been associated with PD. These findings are consistent with previous reports implicating lower glucocerebrosidase activity with PD risk
A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data
Motivation: Copy Number Variants (CNVs) are important genetic factors for studying human diseases. While high-throughput whole genome re-sequencing provides multiple lines of evidence for detecting CNVs, computational algorithms need to be tailored for different type or size of CNVs under different experimental designs. Results: To achieve optimal power and resolution of detecting CNVs at low depth of coverage, we implemented a Hidden Markov Model that integrates both depth of coverage and mate-pair relationship. The novelty of our algorithm is that we infer the likelihood of carrying a deletion jointly from multiple mate pairs in a region without the requirement of a single mate pairs being obvious outliers. By integrating all useful information in a comprehensive model, our method is able to detect medium-size deletions (200-2000bp) at low depth (<10Ć per sample). We applied the method to simulated data and demonstrate the power of detecting medium-size deletions is close to theoretical values. Availability: A program implemented in Java, Zinfandel, is available at http://www.cs.columbia.edu/~itsik/zinfandel
Recommended from our members
Extended haplotype association study in Crohnās disease identifies a novel, Ashkenazi Jewish-specific missense mutation in the NF-ĪŗB pathway gene, HEATR3
The Ashkenazi Jewish population has a several-fold higher prevalence of Crohnās disease compared to non-Jewish European ancestry populations and has a unique genetic history. Haplotype association is critical to Crohnās disease etiology in this population, most notably at NOD2, in which three causal, uncommon, and conditionally independent NOD2 variants reside on a shared background haplotype. We present an analysis of extended haplotypes which showed significantly greater association to Crohnās disease in the Ashkenazi Jewish population compared to a non-Jewish population (145 haplotypes and no haplotypes with P-value < 10ā3, respectively). Two haplotype regions, one each on chromosomes 16 and 21, conferred increased disease risk within established Crohnās disease loci. We performed exome sequencing of 55 Ashkenazi Jewish individuals and follow-up genotyping focused on variants in these two regions. We observed Ashkenazi Jewish-specific nominal association at R755C in TRPM2 on chromosome 21. Within the chromosome 16 region, R642S of HEATR3 and rs9922362 of BRD7 showed genome-wide significance. Expression studies of HEATR3 demonstrated a positive role in NOD2-mediated NF-ĪŗB signaling. The BRD7 signal showed conditional dependence with only the downstream rare Crohnās disease-causal variants in NOD2, but not with the background haplotype; this elaborates NOD2 as a key illustration of synthetic association
WGS-based telomere length analysis in Dutch family trios implicates stronger maternal inheritance and a role for RRM1 gene
Telomere length (TL) regulation is an important factor in ageing, reproduction and cancer development. Genetic, hereditary and environmental factors regulating TL are currently widely investigated, however, their relative contribution to TL variability is still understudied. We have used whole genome sequencing data of 250 family trios from the Genome of the Netherlands project to perform computational measurement of TL and a series of regression and genome-wide association analyses to reveal TL inheritance patterns and associated genetic factors. Our results confirm that TL is a largely heritable trait, primarily with motherās, and, to a lesser extent, with fatherās TL having the strongest influence on the offspring. In this cohort, motherās, but not fatherās age at conception was positively linked to offspring TL. Age-related TL attrition of 40 bp/year had relatively small influence on TL variability. Finally, we have identified TL-associated variations in ribonuclease reductase catalytic subunit M1 (RRM1 gene), which is known to regulate telomere maintenance in yeast. We also highlight the importance of multivariate approach and the limitations of existing tools for the analysis of TL as a polygenic heritable quantitative trait
- ā¦