1,569 research outputs found
An EM algorithm based on an internal list for estimating haplotype distributions of rare variants from pooled genotype data
10.1186/1471-2156-14-82BMC Genetics14-BGME
Recommended from our members
Haplotype Inference through Sequential Monte Carlo
Technological advances in the last decade have given rise to large Genome Wide Studies which have helped researchers get better insights in the genetic basis of many common diseases. As the number of samples and genome coverage has increased dramatically it is currently typical that individuals are genotyped using high throughput platforms to more than 500,000 Single Nucleotide Polymorphisms. At the same time theoretical and empirical arguments have been made for the use of haplotypes, i.e. combinations of alleles at multiple loci in individual chromosomes, as opposed to genotypes so the problem of haplotype inference is particularly relevant. Existing haplotyping methods include population based methods, methods for pooled DNA samples and methods for family and pedigree data. Furthermore, the vast amount of available data pose new challenges for haplotyping algorithms. Candidate methods should scale well to the size of the datasets as the number of loci and the number of individuals are well to the thousands. In addition, as genotyping can be performed routinely, researchers encounter a number of specific new scenarios, which can be seen as hybrid between the population and pedigree inference scenarios and require special care to incorporate the maximum amount of information. In this thesis we present a Sequential Monte Carlo framework (TDS) and tailor it to address instances of haplotype inference and frequency estimation problems. Specifically, we first adjust our framework to perform haplotype inference in trio families resulting in a methodology that demonstrates an excellent tradeoff between speed and accuracy. Consequently, we extend our method to handle general nuclear families and demonstrate the gain using our approach as opposed to alternative scenarios. We further address the problem of haplotype inference in pooling data in which we show that our method achieves improved performance over existing approaches in datasets with large number of markers. We finally present a framework to handle the haplotype inference problem in regions of CNV/SNP data. Using our approach we can phase datasets where the ploidy of an individual can vary along the region and each individual can have different breakpoints
Haplotype frequency inference from pooled genetic data with a latent multinomial model
In genetic studies, haplotype data provide more refined information than data
about separate genetic markers. However, large-scale studies that genotype
hundreds to thousands of individuals may only provide results of pooled data,
where only the total allele counts of each marker in each pool are reported.
Methods for inferring haplotype frequencies from pooled genetic data that scale
well with pool size rely on a normal approximation, which we observe to produce
unreliable inference when applied to real data. We illustrate cases where the
approximation breaks down, due to the normal covariance matrix being
near-singular. As an alternative to approximate methods, in this paper we
propose exact methods to infer haplotype frequencies from pooled genetic data
based on a latent multinomial model, where the observed allele counts are
considered integer combinations of latent, unobserved haplotype counts. One of
our methods, latent count sampling via Markov bases, achieves approximately
linear runtime with respect to pool size. Our exact methods produce more
accurate inference over existing approximate methods for synthetic data and for
data based on haplotype information from the 1000 Genomes Project. We also
demonstrate how our methods can be applied to time-series of pooled genetic
data, as a proof of concept of how our methods are relevant to more complex
hierarchical settings, such as spatiotemporal models.Comment: 35 pages, 16 figures, 3 algorithms, submitted to Biometrics journa
A whole genome association study of neuroticism using DNA pooling.
We describe a multistage approach to identify single nucleotide polymorphisms (SNPs) associated with neuroticism, a personality trait that shares genetic determinants with major depression and anxiety disorders. Whole genome association with 452 574 SNPs was performed on DNA pools from approximately 2000 individuals selected on extremes of neuroticism scores from a cohort of 88 142 people from southwest England. The most significant SNPs were then genotyped on independent samples to replicate findings. We were able to replicate association of one SNP within the PDE4D gene in a second sample collected by our laboratory and in a family-based test in an independent sample; however, the SNP was not significantly associated with neuroticism in two other independent samples. We also observed an enrichment of low P-values in known regions of copy number variations. Simulation indicates that our study had approximately 80% power to identify neuroticism loci in the genome with odds ratio (OR)>2, and approximately 50% power to identify small effects (OR=1.5). Since we failed to find any loci accounting for more than 1% of the variance, the heritability of neuroticism probably arises from many loci each explaining much less than 1%. Our findings argue the need for much larger samples than anticipated in genetic association studies and that the biological basis of emotional disorders is extremely complex
Application of next generation sequencing in genetic and genomic studies
Genetic variants that spread along the human genome play vital roles in determining our
traits, affecting development and potentially causing disorders. Most common disorders have
complex underlying mechanisms involving genetic or environmental factors and the
interaction between them. Over the past decade, genome-wide association studies (GWAS)
have identified thousands of common variants that contribute to complex disorders and
partially explain the heritability. However, there is still a large portion that is unexplained and
the missing heritability may be caused by several factors, such as rare or low-frequency
variants with high effect that are not covered by GWAS and linkage analysis. With the
development of next generation sequencing (NGS), it is possible to rapidly detect large
amount of novel rare and low-frequency variants simultaneously at a low cost. This new
technology provides vast information on studying the association of genetic variations and
complex disorders. Once the susceptibility gene is mapped, model organisms such as
zebrafish (Danio rerio) are popular for further investigating the possible function of diseaseassociated gene in determining the phenotype. However, the genome annotation of zebrafish
is not complete, which affects the characterization of gene functions. Accordingly, highthroughput RNA sequencing can be employed for identifying new transcripts.
In our studies, pooled DNA samples were used for whole genome sequencing (WGS) and
exome sequencing. In Paper I, we evaluated minor allele frequency (MAF) estimates using
three variant detection tools with two sets of pooled exome sequencing and one set of pooled
WGS data. The MAFs from the pooled sequencing data demonstrated high concordance (r =
0.88-0.94) with those from the individual genotyping data. In Paper II, exome sequencing
implementing pooling strategy was performed on 100 idiopathic scoliosis (IS) patients for
mapping susceptibility genes. After validating 20 candidate single nucleotide variants
(SNVs), we did not find associations between them and IS. However, the previously reported
common variant rs11190870 near LBX1 was validated in a large Scandinavian cohort. In
Paper III, we analyzed WGS of pooled DNA samples performed on 19 affected individuals
who shared a phenotype-linked haplotype in a dyslexic Finish family. Two of the individuals
were sequenced for the whole genome individually as well. The screen for causative variants
was narrowed down to a rare SNV, which might affect the binding affinity of LHX2 that
regulated dyslexia associated gene ROBO1. In Paper IV, RNA sequencing (RNA-seq) data
were analyzed for identifying novel transcripts in zebrafish early development using an inhouse pipeline. We discovered 152 novel transcribed regions (NTRs), validated more than 10
NTRs and quantified their expression in early developmental stages.
In our studies, we evaluated and applied a pooling approach for identifying variants
susceptible to disease using high-throughput DNA sequencing. Based on RNA sequencing
data, we provided new information for genome annotation on model organism zebrafish,
which is valuable for studying the function of disease causative genes. In summary, the whole
series of studies demonstrate how NGS can be applied in studying the genetic basis of
complex disorders and assisting in follow-up functional studies in model organisms
Exploiting natural selection to study adaptive behavior
The research presented in this dissertation explores different computational and modeling techniques that combined with predictions from evolution by natural selection leads to the analysis of the adaptive behavior of populations under selective pressure.
For this thesis three computational methods were developed: EXPLoRA, EVORhA and SSA-ME. EXPLoRA finds genomic regions associated with a trait of interests (QTL) by explicitly modeling the expected linkage disequilibrium of a population of sergeants under selection. Data from BSA experiments was analyzed to find genomic loci associated with ethanol tolerance. EVORhA explores the interplay between driving and hitchhiking mutations during evolution to reconstruct the subpopulation structure of clonal bacterial populations based on deep sequencing data. Data from mixed infections and evolution experiments of E. Coli was used and their population structure reconstructed. SSA-ME uses mutual exclusivity in cancer to prioritize cancer driver genes. TCGA data of breast cancer tumor samples were analyzed.status: publishe
Recommended from our members
Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data
Background: Typically, the first phase of a genome wide association study (GWAS) includes genotyping across hundreds of individuals and validation of the most significant SNPs. Allelotyping of pooled genomic DNA is a common approach to reduce the overall cost of the study. Knowledge of haplotype structure can provide additional information to single locus analyses. Several methods have been proposed for estimating haplotype frequencies in a population from pooled DNA data. Results: We introduce a technique for haplotype frequency estimation in a population from pooled DNA samples focusing on datasets containing a small number of individuals per pool (2 or 3 individuals) and a large number of markers. We compare our method with the publicly available state-of-the-art algorithms HIPPO and HAPLOPOOL on datasets of varying number of pools and marker sizes. We demonstrate that our algorithm provides improvements in terms of accuracy and computational time over competing methods for large number of markers while demonstrating comparable performance for smaller marker sizes. Our method is implemented in the "Tree-Based Deterministic Sampling Pool" (TDSPool) package which is available for download at http://www.ee.columbia.edu/~anastas/tdspool. Conclusions: Using a tree-based determinstic sampling technique we present an algorithm for haplotype frequency estimation from pooled data. Our method demonstrates superior performance in datasets with large number of markers and could be the method of choice for haplotype frequency estimation in such datasets
Recommended from our members
Maximum-parsimony haplotype frequencies inference based on a joint constrained sparse representation of pooled DNA
Background: DNA pooling constitutes a cost effective alternative in genome wide association studies. In DNA pooling, equimolar amounts of DNA from different individuals are mixed into one sample and the frequency of each allele in each position is observed in a single genotype experiment. The identification of haplotype frequencies from pooled data in addition to single locus analysis is of separate interest within these studies as haplotypes could increase statistical power and provide additional insight. Results: We developed a method for maximum-parsimony haplotype frequency estimation from pooled DNA data based on the sparse representation of the DNA pools in a dictionary of haplotypes. Extensions to scenarios where data is noisy or even missing are also presented. The resulting method is first applied to simulated data based on the haplotypes and their associated frequencies of the AGT gene. We further evaluate our methodology on datasets consisting of SNPs from the first 7Mb of the HapMap CEU population. Noise and missing data were further introduced in the datasets in order to test the extensions of the proposed method. Both HIPPO and HAPLOPOOL were also applied to these datasets to compare performances. Conclusions: We evaluate our methodology on scenarios where pooling is more efficient relative to individual genotyping; that is, in datasets that contain pools with a small number of individuals. We show that in such scenarios our methodology outperforms state-of-the-art methods such as HIPPO and HAPLOPOOL
- …