4,460 research outputs found

    Whole genome association mapping by incompatibilities and local perfect phylogenies

    Get PDF
    BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours

    A fast algorithm for genome-wide haplotype pattern mining

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identifying the genetic components of common diseases has long been an important area of research. Recently, genotyping technology has reached the level where it is cost effective to genotype single nucleotide polymorphism (SNP) markers covering the entire genome, in thousands of individuals, and analyse such data for markers associated with a diseases. The statistical power to detect association, however, is limited when markers are analysed one at a time. This can be alleviated by considering multiple markers simultaneously. The <it>Haplotype Pattern Mining </it>(HPM) method is a machine learning approach to do exactly this.</p> <p>Results</p> <p>We present a new, faster algorithm for the HPM method. The new approach use patterns of haplotype diversity in the genome: locally in the genome, the number of observed haplotypes is much smaller than the total number of possible haplotypes. We show that the new approach speeds up the HPM method with a factor of 2 on a genome-wide dataset with 5009 individuals typed in 491208 markers using default parameters and more if the pattern length is increased.</p> <p>Conclusion</p> <p>The new algorithm speeds up the HPM method and we show that it is feasible to apply HPM to whole genome association mapping with thousands of individuals and hundreds of thousands of markers.</p

    Population genomics and haplotype analysis in bread wheat identify a gene regulating glume pubescence

    Get PDF
    Glume hairiness or pubescence is an important morphological trait with high heritability to distinguish/characterize wheat and is related to the resistance to biotic and abiotic stresses. Hg1 (formerly named Hg) on chromosome arm 1AS controlled glume hairiness in wheat. Its genetic analysis and mapping have been widely studied, yet more useful and accurate information for fine mapping of Hg1 and identification of its candidate gene is lacking. The cloning of this gene has not yet been reported for the large complex wheat genome. Here, we performed a GWAS between SNP markers and glume pubescence (Gp) in a wheat population with 352 lines and further demonstrated the gene expression and haplotype analysis approach for isolating the Hg1 gene. One gene, TraesCSU02G143200 (TaELD1-1A), encoding glycosyltransferase-like ELD1/KOBITO 1, was identified as the most promising candidate gene of Hg1. The gene annotation, expression pattern, function SNP variation, haplotype analysis, and co-expression analysis in floral organ (spike) development indicated that it is likely to be involved in the regulation of glume pubescence. Our study demonstrates the importance of high-quality reference genomes and annotation information, as well as bioinformatics analysis, for gene cloning in wheat

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Molecular approaches for characterization and use of natural disease resistance in wheat

    Get PDF
    Wheat production is threatened by a constantly changing population of pathogen species and races. Given the rapid ability of many pathogens to overcome genetic resistance, the identification and practical implementation of new sources of resistance is essential. Landraces and wild relatives of wheat have played an important role as genetic resources for the improvement of disease resistance. The use of molecular approaches, particularly molecular markers, has allowed better characterization of the genetic diversity in wheat germplasm. In addition, the molecular cloning of major resistance (R) genes has recently been achieved in the large, polyploid wheat genome. For the first time this allows the study and analysis of the genetic variability of wheat R loci at the molecular level and therefore, to screen for allelic variation at such loci in the gene pool. Thus, strategies such as allele mining and ecotilling are now possible for characterization of wheat disease resistance. Here, we discuss the approaches, resources and potential tools to characterize and utilize the naturally occurring resistance diversity in wheat. We also report a first step in allele mining, where we characterize the occurrence of known resistance alleles at the wheat Pm3 powdery mildew resistance locus in a set of 1,320 landraces assembled on the basis of eco-geographical criteria. From known Pm3 R alleles, only Pm3b was frequently identified (3% of the tested accessions). In the same set of landraces, we found a high frequency of a Pm3 haplotype carrying a susceptible allele of Pm3. This analysis allowed the identification of a set of resistant lines where new potentially functional alleles would be present. Newly identified resistance alleles will enrich the genetic basis of resistance in breeding programmes and contribute to wheat improvemen

    LD-Spline: Mapping SNPs on genotyping platforms to genomic regions using patterns of linkage disequilibrium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene-centric analysis tools for genome-wide association study data are being developed both to annotate single locus statistics and to prioritize or group single nucleotide polymorphisms (SNPs) prior to analysis. These approaches require knowledge about the relationships between SNPs on a genotyping platform and genes in the human genome. SNPs in the genome can represent broader genomic regions via linkage disequilibrium (LD), and population-specific patterns of LD can be exploited to generate a data-driven map of SNPs to genes.</p> <p>Methods</p> <p>In this study, we implemented LD-Spline, a database routine that defines the genomic boundaries a particular SNP represents using linkage disequilibrium statistics from the International HapMap Project. We compared the LD-Spline haplotype block partitioning approach to that of the four gamete rule and the Gabriel et al. approach using simulated data; in addition, we processed two commonly used genome-wide association study platforms.</p> <p>Results</p> <p>We illustrate that LD-Spline performs comparably to the four-gamete rule and the Gabriel et al. approach; however as a SNP-centric approach LD-Spline has the added benefit of systematically identifying a genomic boundary for each SNP, where the global block partitioning approaches may falter due to sampling variation in LD statistics.</p> <p>Conclusion</p> <p>LD-Spline is an integrated database routine that quickly and effectively defines the genomic region marked by a SNP using linkage disequilibrium, with a SNP-centric block definition algorithm.</p
    • …
    corecore