1,569 research outputs found

    Haplotype frequency inference from pooled genetic data with a latent multinomial model

    Full text link
    In genetic studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data, where only the total allele counts of each marker in each pool are reported. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation breaks down, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the observed allele counts are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for data based on haplotype information from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.Comment: 35 pages, 16 figures, 3 algorithms, submitted to Biometrics journa

    A whole genome association study of neuroticism using DNA pooling.

    Get PDF
    We describe a multistage approach to identify single nucleotide polymorphisms (SNPs) associated with neuroticism, a personality trait that shares genetic determinants with major depression and anxiety disorders. Whole genome association with 452 574 SNPs was performed on DNA pools from approximately 2000 individuals selected on extremes of neuroticism scores from a cohort of 88 142 people from southwest England. The most significant SNPs were then genotyped on independent samples to replicate findings. We were able to replicate association of one SNP within the PDE4D gene in a second sample collected by our laboratory and in a family-based test in an independent sample; however, the SNP was not significantly associated with neuroticism in two other independent samples. We also observed an enrichment of low P-values in known regions of copy number variations. Simulation indicates that our study had approximately 80% power to identify neuroticism loci in the genome with odds ratio (OR)>2, and approximately 50% power to identify small effects (OR=1.5). Since we failed to find any loci accounting for more than 1% of the variance, the heritability of neuroticism probably arises from many loci each explaining much less than 1%. Our findings argue the need for much larger samples than anticipated in genetic association studies and that the biological basis of emotional disorders is extremely complex

    Application of next generation sequencing in genetic and genomic studies

    Get PDF
    Genetic variants that spread along the human genome play vital roles in determining our traits, affecting development and potentially causing disorders. Most common disorders have complex underlying mechanisms involving genetic or environmental factors and the interaction between them. Over the past decade, genome-wide association studies (GWAS) have identified thousands of common variants that contribute to complex disorders and partially explain the heritability. However, there is still a large portion that is unexplained and the missing heritability may be caused by several factors, such as rare or low-frequency variants with high effect that are not covered by GWAS and linkage analysis. With the development of next generation sequencing (NGS), it is possible to rapidly detect large amount of novel rare and low-frequency variants simultaneously at a low cost. This new technology provides vast information on studying the association of genetic variations and complex disorders. Once the susceptibility gene is mapped, model organisms such as zebrafish (Danio rerio) are popular for further investigating the possible function of diseaseassociated gene in determining the phenotype. However, the genome annotation of zebrafish is not complete, which affects the characterization of gene functions. Accordingly, highthroughput RNA sequencing can be employed for identifying new transcripts. In our studies, pooled DNA samples were used for whole genome sequencing (WGS) and exome sequencing. In Paper I, we evaluated minor allele frequency (MAF) estimates using three variant detection tools with two sets of pooled exome sequencing and one set of pooled WGS data. The MAFs from the pooled sequencing data demonstrated high concordance (r = 0.88-0.94) with those from the individual genotyping data. In Paper II, exome sequencing implementing pooling strategy was performed on 100 idiopathic scoliosis (IS) patients for mapping susceptibility genes. After validating 20 candidate single nucleotide variants (SNVs), we did not find associations between them and IS. However, the previously reported common variant rs11190870 near LBX1 was validated in a large Scandinavian cohort. In Paper III, we analyzed WGS of pooled DNA samples performed on 19 affected individuals who shared a phenotype-linked haplotype in a dyslexic Finish family. Two of the individuals were sequenced for the whole genome individually as well. The screen for causative variants was narrowed down to a rare SNV, which might affect the binding affinity of LHX2 that regulated dyslexia associated gene ROBO1. In Paper IV, RNA sequencing (RNA-seq) data were analyzed for identifying novel transcripts in zebrafish early development using an inhouse pipeline. We discovered 152 novel transcribed regions (NTRs), validated more than 10 NTRs and quantified their expression in early developmental stages. In our studies, we evaluated and applied a pooling approach for identifying variants susceptible to disease using high-throughput DNA sequencing. Based on RNA sequencing data, we provided new information for genome annotation on model organism zebrafish, which is valuable for studying the function of disease causative genes. In summary, the whole series of studies demonstrate how NGS can be applied in studying the genetic basis of complex disorders and assisting in follow-up functional studies in model organisms

    Exploiting natural selection to study adaptive behavior

    Get PDF
    The research presented in this dissertation explores different computational and modeling techniques that combined with predictions from evolution by natural selection leads to the analysis of the adaptive behavior of populations under selective pressure. For this thesis three computational methods were developed: EXPLoRA, EVORhA and SSA-ME. EXPLoRA finds genomic regions associated with a trait of interests (QTL) by explicitly modeling the expected linkage disequilibrium of a population of sergeants under selection. Data from BSA experiments was analyzed to find genomic loci associated with ethanol tolerance. EVORhA explores the interplay between driving and hitchhiking mutations during evolution to reconstruct the subpopulation structure of clonal bacterial populations based on deep sequencing data. Data from mixed infections and evolution experiments of E. Coli was used and their population structure reconstructed. SSA-ME uses mutual exclusivity in cancer to prioritize cancer driver genes. TCGA data of breast cancer tumor samples were analyzed.status: publishe
    corecore