1,082 research outputs found

    A new mixture model approach to analyzing allelic-loss data using Bayes factors

    Get PDF
    BACKGROUND: Allelic-loss studies record data on the loss of genetic material in tumor tissue relative to normal tissue at various loci along the genome. As the deletion of a tumor suppressor gene can lead to tumor development, one objective of these studies is to determine which, if any, chromosome arms harbor tumor suppressor genes. RESULTS: We propose a large class of mixture models for describing the data, and we suggest using Bayes factors to select a reasonable model from the class in order to classify the chromosome arms. Bayes factors are especially useful in the case of testing that the number of components in a mixture model is n(0 )versus n(1). In these cases, frequentist test statistics based on the likelihood ratio statistic have unknown distributions and are therefore not applicable. Our simulation study shows that Bayes factors favor the right model most of the time when tumor suppressor genes are present. When no tumor suppressor genes are present and background allelic-loss varies, the Bayes factors are often inconclusive, although this results in a markedly reduced false-positive rate compared to that of standard frequentist approaches. Application of our methods to three data sets of esophageal adenocarcinomas yields interesting differences from those results previously published. CONCLUSIONS: Our results indicate that Bayes factors are useful for analyzing allelic-loss data

    Improving population-specific allele frequency estimates by adapting supplemental data: an empirical Bayes approach

    Full text link
    Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS121 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Statistical Methods For Genomic And Transcriptomic Sequencing

    Get PDF
    Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

    Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies

    Full text link
    Effect-size heterogeneity is a commonly observed phenomenon when aggregating studies from different ancestries to conduct trans-ethnic meta-analysis. Irrespective of the sources of heterogeneity, traditional meta-analysis approaches cannot appropriately account for the expected between-study heterogeneity. Therefore, to bridge the methodological gap, in the first two projects, I develop statistical methods for modeling the heterogeneous effects in trans-ethnic meta-analysis for genome-wide association studies (GWAS). In the third project, I extend the methods in trans-ethnic GWAS meta-analysis to a general statistical framework for modeling heterogeneity in biomedical studies. In the first project, I develop a score test for the common variant GWAS trans-ethnic meta-analysis. To account for the expected genetic effect heterogeneity across diverse populations, I adopt a modified random effects model from the kernel regression framework, and use the adaptive variance component test to achieve robust power regardless of the degree of genetic effect heterogeneity. From extensive simulation studies, I demonstrate that the proposed method has well-calibrated type I error rates at very stringent significance levels and can improve power over traditional meta-analysis methods. In the second project, I extend the common variant meta-analysis approach to the gene-based rare variant trans-ethnic meta-analysis. I develop a unified score test which is capable of incorporating different levels of heterogeneous genetic effects across multiple ancestry groups. I employ a resampling-based copula method to estimate the asymptotic distribution of the proposed test, which enables efficient estimation of p-values. I conduct simulation studies to demonstrate that the proposed approach is well-calibrated at stringent significance levels and improves power over current approaches under the existence of genetic effect heterogeneity. As a real data application, I further apply the proposed method to the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in multi-Ethnic Samples (T2D-GENES) consortia data to explore rare variant associations with several traits. In the third project, I develop a supremum score test for jointly testing the fixed and random effects in a generalized linear mixed model (GLMM). The joint testing framework has many applications in biomedical studies. One example is to use such tests for ascertaining associations under the existence of heterogeneity in GWAS meta-analysis; another example is the nonparametric test of spline curves. The supremum score test first re-parameterizes the fixed effects terms as a product of a scale parameter and a vector of nuisance parameters. With such re-parameterization, the joint test is equivalent to testing whether the scale parameter is zero. Since the nuisance parameters are unidentifiable under the null hypothesis, I propose using the supremum of score test statistics over the nuisance parameters. I employ a resampling-based copula method to approximate the asymptotic null distribution of the proposed score test statistic. I first investigate the performance of the method through simulation studies. Using the Michigan Genomics Initiative (MGI) data, I then demonstrate its application by assessing whether the genetics effects to Low Density Lipoprotein Cholesterol (LDL-C) can be modified by age.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146029/1/shijingc_1.pd

    A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data

    Get PDF
    We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours

    Multiple locus linkage analysis of genomewide expression in yeast.

    Get PDF
    With the ability to measure thousands of related phenotypes from a single biological sample, it is now feasible to genetically dissect systems-level biological phenomena. The genetics of transcriptional regulation and protein abundance are likely to be complex, meaning that genetic variation at multiple loci will influence these phenotypes. Several recent studies have investigated the role of genetic variation in transcription by applying traditional linkage analysis methods to genomewide expression data, where each gene expression level was treated as a quantitative trait and analyzed separately from one another. Here, we develop a new, computationally efficient method for simultaneously mapping multiple gene expression quantitative trait loci that directly uses all of the available data. Information shared across gene expression traits is captured in a way that makes minimal assumptions about the statistical properties of the data. The method produces easy-to-interpret measures of statistical significance for both individual loci and the overall joint significance of multiple loci selected for a given expression trait. We apply the new method to a cross between two strains of the budding yeast Saccharomyces cerevisiae, and estimate that at least 37% of all gene expression traits show two simultaneous linkages, where we have allowed for epistatic interactions. Pairs of jointly linking quantitative trait loci are identified with high confidence for 170 gene expression traits, where it is expected that both loci are true positives for at least 153 traits. In addition, we are able to show that epistatic interactions contribute to gene expression variation for at least 14% of all traits. We compare the proposed approach to an exhaustive two-dimensional scan over all pairs of loci. Surprisingly, we demonstrate that an exhaustive two-dimensional scan is less powerful than the sequential search used here. In addition, we show that a two-dimensional scan does not truly allow one to test for simultaneous linkage, and the statistical significance measured from this existing method cannot be interpreted among many traits

    Integrated study of copy number states and genotype calls using high-density SNP arrays

    Get PDF
    We propose a statistical framework, named genoCN, to simultaneously dissect copy number states and genotypes using high-density SNP (single nucleotide polymorphism) arrays. There are at least two types of genomic DNA copy number differences: copy number variations (CNVs) and copy number aberrations (CNAs). While CNVs are naturally occurring and inheritable, CNAs are acquired somatic alterations most often observed in tumor tissues only. CNVs tend to be short and more sparsely located in the genome compared with CNAs. GenoCN consists of two components, genoCNV and genoCNA, designed for CNV and CNA studies, respectively. In contrast to most existing methods, genoCN is more flexible in that the model parameters are estimated from the data instead of being decided a priori. GenoCNA also incorporates two important strategies for CNA studies. First, the effects of tissue contamination are explicitly modeled. Second, if SNP arrays are performed for both tumor and normal tissues of one individual, the genotype calls from normal tissue are used to study CNAs in tumor tissue. We evaluated genoCN by applications to 162 HapMap individuals and a brain tumor (glioblastoma) dataset and showed that our method can successfully identify both types of copy number differences and produce high-quality genotype calls
    corecore