3,239 research outputs found
Recommended from our members
Haplotype Inference through Sequential Monte Carlo
Technological advances in the last decade have given rise to large Genome Wide Studies which have helped researchers get better insights in the genetic basis of many common diseases. As the number of samples and genome coverage has increased dramatically it is currently typical that individuals are genotyped using high throughput platforms to more than 500,000 Single Nucleotide Polymorphisms. At the same time theoretical and empirical arguments have been made for the use of haplotypes, i.e. combinations of alleles at multiple loci in individual chromosomes, as opposed to genotypes so the problem of haplotype inference is particularly relevant. Existing haplotyping methods include population based methods, methods for pooled DNA samples and methods for family and pedigree data. Furthermore, the vast amount of available data pose new challenges for haplotyping algorithms. Candidate methods should scale well to the size of the datasets as the number of loci and the number of individuals are well to the thousands. In addition, as genotyping can be performed routinely, researchers encounter a number of specific new scenarios, which can be seen as hybrid between the population and pedigree inference scenarios and require special care to incorporate the maximum amount of information. In this thesis we present a Sequential Monte Carlo framework (TDS) and tailor it to address instances of haplotype inference and frequency estimation problems. Specifically, we first adjust our framework to perform haplotype inference in trio families resulting in a methodology that demonstrates an excellent tradeoff between speed and accuracy. Consequently, we extend our method to handle general nuclear families and demonstrate the gain using our approach as opposed to alternative scenarios. We further address the problem of haplotype inference in pooling data in which we show that our method achieves improved performance over existing approaches in datasets with large number of markers. We finally present a framework to handle the haplotype inference problem in regions of CNV/SNP data. Using our approach we can phase datasets where the ploidy of an individual can vary along the region and each individual can have different breakpoints
Parsimony-based genetic algorithm for haplotype resolution and block partitioning
This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster
SNP haplotype tagging from DNA pools of two individuals
BACKGROUND: DNA pooling is a technique to reduce genotyping effort while incurring only minor losses in accuracy of allele frequency estimates for single nucleotide polymorphism (SNP) markers. RESULTS: We present an algorithm for reconstructing haplotypes (alleles for multiple SNPs on same chromosome) from pools of two individual DNAs, in which Hardy-Weinberg equilibrium conditions or other assumptions are not required. The program outputs, in addition to inferred haplotypes, a minimal number of haplotype-tagging SNPs that are identified after an exhaustive search procedure. CONCLUSION: Our method and algorithms lead to a significant reduction in genotyping effort, for example, in case-control disease association studies while maintaining the possibility of reconstructing haplotypes under very general conditions
Using DNA pools for genotyping trios
The genotyping of mother–father–child trios is a very useful tool in disease association studies, as trios eliminate population stratification effects and increase the accuracy of haplotype inference. Unfortunately, the use of trios for association studies may reduce power, since it requires the genotyping of three individuals where only four independent haplotypes are involved. We describe here a method for genotyping a trio using two DNA pools, thus reducing the cost of genotyping trios to that of genotyping two individuals. Furthermore, we present extensions to the method that exploit the linkage disequilibrium structure to compensate for missing data and genotyping errors. We evaluated our method on trios from CEPH pedigree 66 of the Coriell Institute. We demonstrate that the error rates in the genotype calls of the proposed protocol are comparable to those of standard genotyping techniques, although the cost is reduced considerably. The approach described is generic and it can be applied to any genotyping platform that achieves a reasonable precision of allele frequency estimates from pools of two individuals. Using this approach, future trio-based association studies may be able to increase the sample size by 50% for the same cost and thereby increase the power to detect associations
A 32 kb Critical Region Excluding Y402H in CFH Mediates Risk for Age-Related Macular Degeneration
Complement factor H shows very strong association with Age-related Macular Degeneration (AMD), and recent data suggest that multiple causal variants are associated with disease. To refine the location of the disease associated variants, we characterized in detail the structural variation at CFH and its paralogs, including two copy number polymorphisms (CNP), CNP147 and CNP148, and several rare deletions and duplications. Examination of 34 AMD-enriched extended families (N = 293) and AMD cases (White N = 4210 Indian = 134; Malay = 140) and controls (White N = 3229; Indian = 117; Malay = 2390) demonstrated that deletion CNP148 was protective against AMD, independent of SNPs at CFH. Regression analysis of seven common haplotypes showed three haplotypes, H1, H6 and H7, as conferring risk for AMD development. Being the most common haplotype H1 confers the greatest risk by increasing the odds of AMD by 2.75-fold (95% CI = [2.51, 3.01]; p = 8.31×10−109); Caucasian (H6) and Indian-specific (H7) recombinant haplotypes increase the odds of AMD by 1.85-fold (p = 3.52×10−9) and by 15.57-fold (P = 0.007), respectively. We identified a 32-kb region downstream of Y402H (rs1061170), shared by all three risk haplotypes, suggesting that this region may be critical for AMD development. Further analysis showed that two SNPs within the 32 kb block, rs1329428 and rs203687, optimally explain disease association. rs1329428 resides in 20 kb unique sequence block, but rs203687 resides in a 12 kb block that is 89% similar to a noncoding region contained in ΔCNP148. We conclude that causal variation in this region potentially encompasses both regulatory effects at single markers and copy number
The maintenance of standing genetic variation: Gene flow vs. selective neutrality in Atlantic stickleback fish
Adaptation to derived habitats often occurs from standing genetic variation. The maintenance within ancestral populations of genetic variants favourable in derived habitats is commonly ascribed to long-term antagonism between purifying selection and gene flow resulting from hybridization across habitats. A largely unexplored alternative idea based on quantitative genetic models of polygenic adaptation is that variants favoured in derived habitats are neutral in ancestral populations when their frequency is relatively low. To explore the latter, we first identify genetic variants important to the adaptation of threespine stickleback fish (Gasterosteus aculeatus) to a rare derived habitat-nutrient-depleted acidic lakes-based on whole-genome sequence data. Sequencing marine stickleback from six locations across the Atlantic Ocean then allows us to infer that the frequency of these derived variants in the ancestral habitat is unrelated to the likely opportunity for gene flow of these variants from acidic-adapted populations. This result is consistent with the selective neutrality of derived variants within the ancestor. Our study thus supports an underappreciated explanation for the maintenance of standing genetic variation, and calls for a better understanding of the fitness consequences of adaptive variation across habitats and genomic backgrounds
- …