6 research outputs found
Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads
Background Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping. Results In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/. Conclusions Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies
Informative SNP Selection and Validation
The search for genetic regions associated with complex diseases, such as cancer or Alzheimer\u27s disease, is an important challenge that may lead to better diagnosis and treatment. The existence of millions of DNA variations, primarily single nucleotide polymorphisms (SNPs), may allow the fine dissection of such associations. However, studies seeking disease association are limited by the cost of genotyping SNPs. Therefore, it is essential to find a small subset of informative SNPs (tag SNPs) that may be used as good representatives of the rest of the SNPs. Several informative SNP selection methods have been developed. Our experiments compare favorably to all the prediction and statistical methods by selecting the least number of informative SNPs. We proposed algorithms for faster prediction which yielded acceptable trade off. We validated our results using the k-fold test and its many variations
Linkage, association, and haplotype analysis: A spectrum of approaches to elucidate the genetic influences of complex human traits
The goal of human genetics is to identify genetic variants that influence a certain trait with the intent to provide a better understanding of the biology behind that trait. As technologies and statistical methods towards this goal have developed, there has been a change in the approaches to identify trait-causing variants. The three projects reported here cover a range of approaches. Early studies focused on family-based data, using linkage analysis to find regions of the genome shared by members with similar trait values. This approach was used to confirm the involvement of CYP2E1 with the level of response to alcohol in sibling pairs with an alcoholic parent. With the advent of high through-put genotyping panels, the field of human genetics has shifted to population-based association studies that seek to find variants that correlate with a trait. This approach was used to search for regions of the genome that infer risk for Pick's disease, a spectrum of heterogeneous dementia diseases, and to reproduce the association with MAPT, a gene with known disease-causing mutations. Haplotype based analysis approaches have emerged to improve the analysis of genomic data. A novel algorithm for haplotype based analysis was developed to identify long haplotypes shared in a population based on genotypes from genome-wide association data and was found to be very accurate when predicting haplotypes within the shared regions. Together, these three projects represent the past, present, and future of the study of human genetics
Recommended from our members
Quantifying recent variation and relatedness in human populations
Advances in the genetic analysis of humans have revealed a surprising abundance of local relatedness between purportedly unrelated individuals. Where common mutations classically inform us of ancient relationships, such segments of pairwise identical by descent (IBD) sharing from a common ancestor are the observable traces of recent inter-mating. Combining these two distinct sources of information can help disentangle the complex genetic structure and flux in human populations. When considered together with a heritable trait, the segments can also be used to interrogate unascertained rare variation and help in locating trait-effecting loci. This work presents methods for comprehensive analysis of population-wide IBD and explores applications to disease and the understanding of recent genetic variation. We propose several strategies for efficient detection of IBD segments in population genotype data. Our novel seed-based algorithm, GERMLINE, can reduce the computational burden of finding pairwise segments from quadratic to nearly linear time in a general population. We demonstrate that this approach is several orders of magnitude faster than the available all-pairs methods while maintaining higher accuracy. Next, we extended the GERMLINE technique to process cohorts of unlimited size by adaptively adjusting the search mechanism to meet resource restrictions. We confirm its effectiveness with an analysis of 50,000 individuals where contemporary methods can only process a few thousand. One draw-back of these two algorithms is the dependence on phased haplotype data as input - a constraint that becomes more difficult with large populations. We propose a solution to this problem with an algorithm that analyzes genotype data directly by exploring all potential haplotypes and scoring each putative segment based on linkage-disequilibrium. This solution significantly outperforms available methods when applied to full sequence data and is computationally efficient enough to analyze thousands of sequenced genomes where current methods can only determine haplotypes for several hundred. Secondly, we outline two algorithms for analyzing available IBD segments to increase our understanding of rare variation and complex disease. Motivated by whole-genome sequencing, we present the INFOSTIP algorithm, which uses IBD segments to optimize the selection of individuals for complete population ascertainment. In simulations, we show that INFOSTIP selection can significantly increase variant inference accuracy over random sampling and posit inference of 60% of an isolated population from 1% optimally selected individuals. Seeking to move beyond pairwise IBD segment analysis, we describe the DASH algorithm, which groups shared segments into IBD "clusters" that are likely to be commonly co-inherited and uses them as proxies for un-typed variation. In simulated disease studies, we show this reference-free approach to be much more powerful for detecting rare causal variants than either traditional single-marker analysis or imputation from a general reference panel. Applying the DASH algorithm to disease traits from different populations, we identify multiple novel loci of association. Together, these novel techniques integrate the power of population and disease genetics