133 research outputs found
SNPFile – A software library and file format for large scale association mapping and population genetics studies
<p>Abstract</p> <p>Background</p> <p>High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and at text files were adequate solutions earlier, the increased data size mandates more efficient solutions.</p> <p>Results</p> <p>We describe a new binary file format for SNP data, together with a software library for file manipulation. The file format stores genotype data together with any kind of additional data, using a flexible serialisation mechanism. The format is designed to be IO efficient for the access patterns of most multi-locus analysis methods.</p> <p>Conclusion</p> <p>The new file format has been very useful for our own studies where it has significantly reduced the informatics burden in keeping track of various secondary data, and where the memory and IO efficiency has greatly simplified analysis runs. A main limitation with the file format is that it is only supported by the very limited set of analysis tools developed in our own lab. This is somewhat alleviated by a scripting interfaces that makes it easy to write converters to and from the format.</p
Local Genealogies in a Linear Mixed Model for Genome-Wide Association Mapping in Complex Pedigreed Populations
INTRODUCTION: The state-of-the-art for dealing with multiple levels of relationship among the samples in genome-wide association studies (GWAS) is unified mixed model analysis (MMA). This approach is very flexible, can be applied to both family-based and population-based samples, and can be extended to incorporate other effects in a straightforward and rigorous fashion. Here, we present a complementary approach, called 'GENMIX (genealogy based mixed model)' which combines advantages from two powerful GWAS methods: genealogy-based haplotype grouping and MMA. SUBJECTS AND METHODS: We validated GENMIX using genotyping data of Danish Jersey cattle and simulated phenotype and compared to the MMA. We simulated scenarios for three levels of heritability (0.21, 0.34, and 0.64), seven levels of MAF (0.05, 0.10, 0.15, 0.20, 0.25, 0.35, and 0.45) and five levels of QTL effect (0.1, 0.2, 0.5, 0.7 and 1.0 in phenotypic standard deviation unit). Each of these 105 possible combinations (3 h(2) x 7 MAF x 5 effects) of scenarios was replicated 25 times. RESULTS: GENMIX provides a better ranking of markers close to the causative locus' location. GENMIX outperformed MMA when the QTL effect was small and the MAF at the QTL was low. In scenarios where MAF was high or the QTL affecting the trait had a large effect both GENMIX and MMA performed similarly. CONCLUSION: In discovery studies, where high-ranking markers are identified and later examined in validation studies, we therefore expect GENMIX to enrich candidates brought to follow-up studies with true positives over false positives more than the MMA would
Extreme selective sweeps independently targeted the X chromosomes of the great apes
The unique inheritance pattern of the X chromosome exposes it to natural selection in a way that is different from that of the autosomes, potentially resulting in accelerated evolution. We perform a comparative analysis of X chromosome polymorphism in 10 great ape species, including humans. In most species, we identify striking megabase-wide regions, where nucleotide diversity is less than 20% of the chromosomal average. Such regions are found exclusively on the X chromosome. The regions overlap partially among species, suggesting that the underlying targets are partly shared among species. The regions have higher proportions of singleton SNPs, higher levels of population differentiation, and a higher nonsynonymous-to-synonymous substitution ratio than the rest of the X chromosome. We show that the extent to which diversity is reduced is incompatible with direct selection or the action of background selection and soft selective sweeps alone, and therefore, we suggest that very strong selective sweeps have independently targeted these specific regions in several species. The only genomic feature that we can identify as strongly associated with loss of diversity is the location of testis-expressed ampliconic genes, which also have reduced diversity around them. We hypothesize that these genes may be responsible for selective sweeps in the form of meiotic drive caused by an intragenomic conflict in male meiosis
Whole genome association mapping by incompatibilities and local perfect phylogenies
BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours
A fast algorithm for genome-wide haplotype pattern mining
<p>Abstract</p> <p>Background</p> <p>Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The <it>Haplotype Pattern Mining </it>(HPM) method is a machine learning approach to do exactly this.</p> <p>Results</p> <p>We present a new, faster algorithm for the HPM method. The new approach use patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome-wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.</p> <p>Conclusion</p> <p>The new algorithm speeds up the HPM method and we show that it is feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.</p
phenosim - A software to simulate phenotypes for testing in genome-wide association studies
<p>Abstract</p> <p>Background</p> <p>There is a great interest in understanding the genetic architecture of complex traits in natural populations. Genome-wide association studies (GWAS) are becoming routine in human, animal and plant genetics to understand the connection between naturally occurring genotypic and phenotypic variation. Coalescent simulations are commonly used in population genetics to simulate genotypes under different parameters and demographic models.</p> <p>Results</p> <p>Here, we present <monospace>phenosim</monospace>, a software to add a phenotype to genotypes generated in time-efficient coalescent simulations. Both qualitative and quantitative phenotypes can be generated and it is possible to partition phenotypic variation between additive effects and epistatic interactions between causal variants. The output formats of <monospace>phenosim</monospace> are directly usable as input for different GWAS tools. The applicability of <monospace>phenosim</monospace> is shown by simulating a genome-wide association study in <it>Arabidopsis thaliana</it>.</p> <p>Conclusions</p> <p>By using the coalescent approach to generate genotypes and <monospace>phenosim</monospace> to add phenotypes, the data sets can be used to assess the influence of various factors such as demography, genetic architecture or selection on the statistical power of association methods to detect causal genetic variants under a wide variety of population genetic scenarios. <monospace>phenosim</monospace> is freely available from the authors' website <url>http://evoplant.uni-hohenheim.de</url></p
Accelerating Haplotype-Based Genome-Wide Association Study Using Perfect Phylogeny and Phase-Known Reference Data
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions
Haplotype frequencies in a sub-region of chromosome 19q13.3, related to risk and prognosis of cancer, differ dramatically between ethnic groups
<p>Abstract</p> <p>Background</p> <p>A small region of about 70 kb on human chromosome 19q13.3 encompasses 4 genes of which 3, <it>ERCC1</it>, <it>ERCC2</it>, and <it>PPP1R13L </it>(aka <it>RAI</it>) are related to DNA repair and cell survival, and one, <it>CD3EAP</it>, aka <it>ASE1</it>, may be related to cell proliferation. The whole region seems related to the cellular response to external damaging agents and markers in it are associated with risk of several cancers.</p> <p>Methods</p> <p>We downloaded the genotypes of all markers typed in the 19q13.3 region in the HapMap populations of European, Asian and African descent and inferred haplotypes. We combined the European HapMap individuals with a Danish breast cancer case-control data set and inferred the association between HapMap haplotypes and disease risk.</p> <p>Results</p> <p>We found that the susceptibility haplotype in our European sample had increased from 2 to 50 percent very recently in the European population, and to almost the same extent in the Asian population. The cause of this increase is unknown. The maximal proportion of overall genetic variation due to differences between groups for Europeans versus Africans and Europeans versus Asians (the F<sub>st </sub>value) closely matched the putative location of the susceptibility variant as judged from haplotype-based association mapping.</p> <p>Conclusion</p> <p>The combined observation that a common haplotype causing an increased risk of cancer in Europeans and a high differentiation between human populations is highly unusual and suggests a causal relationship with a recent increase in Europeans caused either by genetic drift overruling selection against the susceptibility variant or a positive selection for the same haplotype. The data does not allow us to distinguish between these two scenarios. The analysis suggests that the region is not involved in cancer risk in Africans and that the susceptibility variants may be more finely mapped in Asian populations.</p
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
Insights into bear evolution from a Pleistocene polar bear genome
The polar bear (Ursus maritimus) has become a symbol of the threat to biodiversity from climate change. Understanding polar bear evolutionary history may provide insights into apex carnivore responses and prospects during periods of extreme environmental perturbations. In recent years, genomic studies have examined bear speciation and population history, including evidence for ancient admixture between polar bears and brown bears (Ursus arctos). Here, we extend our earlier studies of a 130,000- to 115,000-y-old polar bear from the Svalbard Archipelago using a 10x coverage genome sequence and 10 new genomes of polar and brown bears from contemporary zones of overlap in northern Alaska. We demonstrate a dramatic decline in effective population size for this ancient polar bear's lineage, followed by a modest increase just before its demise. A slightly higher genetic diversity in the ancient polar bear suggests a severe genetic erosion over a prolonged bottleneck in modern polar bears. Statistical fitting of data to alternative admixture graph scenarios favors at least one ancient introgression event from brown bears into the ancestor of polar bears, possibly dating back over 150,000 y. Gene flow was likely bidirectional, but allelic transfer from brown into polar bear is the strongest detected signal, which contrasts with other published work. These findings may have implications for our understanding of climate change impacts: Polar bears, a specialist Arctic lineage, may not only have undergone severe genetic bottlenecks but also been the recipient of generalist, boreal genetic variants from brown bears during critical phases of Northern Hemisphere glacial oscillations.Peer reviewe
- …