5,656 research outputs found

    Comparative analysis of haplotype association mapping algorithms

    Get PDF
    BACKGROUND: Finding the genetic causes of quantitative traits is a complex and difficult task. Classical methods for mapping quantitative trail loci (QTL) in miceuse an F2 cross between two strains with substantially different phenotype and an interval mapping method to compute confidence intervals at each position in the genome. This process requires significant resources for breeding and genotyping, and the data generated are usually only applicable to one phenotype of interest. Recently, we reported the application of a haplotype association mapping method which utilizes dense genotyping data across a diverse panel of inbred mouse strains and a marker association algorithm that is independent of any specific phenotype. As the availability of genotyping data grows in size and density, analysis of these haplotype association mapping methods should be of increasing value to the statistical genetics community. RESULTS: We describe a detailed comparative analysis of variations on our marker association method. In particular, we describe the use of inferred haplotypes from adjacent SNPs, parametric and nonparametric statistics, and control of multiple testing error. These results show that nonparametric methods are slightly better in the test cases we study, although the choice of test statistic may often be dependent on the specific phenotype and haplotype structure being studied. The use of multi-SNP windows to infer local haplotype structure is critical to the use of a diverse panel of inbred strains for QTL mapping. Finally, because the marginal effect of any single gene in a complex disease is often relatively small, these methods require the use of sensitive methods for controlling family-wise error. We also report our initial application of this method to phenotypes cataloged in the Mouse Phenome Database. CONCLUSION: The use of inbred strains of mice for QTL mapping has many advantages over traditional methods. However, there are also limitations in comparison to the traditional linkage analysis from F2 and RI lines. Application of these methods requires careful consideration of algorithmic choices based on both theoretical and practical factors. Our findings suggest general guidelines, though a complete evaluation of these methods can only be performed as more genetic data in complex diseases becomes available

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

    Full text link
    Copy number variants (CNVs) account for more polymorphic base pairs in the human genome than do single nucleotide polymorphisms (SNPs). CNVs encompass genes as well as noncoding DNA, making these polymorphisms good candidates for functional variation. Consequently, most modern genome-wide association studies test CNVs along with SNPs, after inferring copy number status from the data generated by high-throughput genotyping platforms. Here we give an overview of CNV genomics in humans, highlighting patterns that inform methods for identifying CNVs. We describe how genotyping signals are used to identify CNVs and provide an overview of existing statistical models and methods used to infer location and carrier status from such data, especially the most commonly used methods exploring hybridization intensity. We compare the power of such methods with the alternative method of using tag SNPs to identify CNV carriers. As such methods are only powerful when applied to common CNVs, we describe two alternative approaches that can be informative for identifying rare CNVs contributing to disease risk. We focus particularly on methods identifying de novo CNVs and show that such methods can be more powerful than case-control designs. Finally we present some recommendations for identifying CNVs contributing to common complex disorders.Comment: Published in at http://dx.doi.org/10.1214/09-STS304 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    An Ultra-High-Density, Transcript-Based, Genetic Map of Lettuce.

    Get PDF
    We have generated an ultra-high-density genetic map for lettuce, an economically important member of the Compositae, consisting of 12,842 unigenes (13,943 markers) mapped in 3696 genetic bins distributed over nine chromosomal linkage groups. Genomic DNA was hybridized to a custom Affymetrix oligonucleotide array containing 6.4 million features representing 35,628 unigenes of Lactuca spp. Segregation of single-position polymorphisms was analyzed using 213 F7:8 recombinant inbred lines that had been generated by crossing cultivated Lactuca sativa cv. Salinas and L. serriola acc. US96UC23, the wild progenitor species of L. sativa The high level of replication of each allele in the recombinant inbred lines was exploited to identify single-position polymorphisms that were assigned to parental haplotypes. Marker information has been made available using GBrowse to facilitate access to the map. This map has been anchored to the previously published integrated map of lettuce providing candidate genes for multiple phenotypes. The high density of markers achieved in this ultradense map allowed syntenic studies between lettuce and Vitis vinifera as well as other plant species

    Uncovering Hidden Diversity in Plants

    Get PDF
    One of the greatest challenges to human civilization in the 21st century will be to provide global food security to a growing population while reducing the environmental footprint of agriculture. Despite increasing demand, the fundamental issue of limited genetic diversity in domesticated crops provides windows of opportunity for emerging pandemics and the insufficient ability of modern crops to respond to a changing global environment. The wild relatives of crop plants, with large reservoirs of untapped genetic diversity, offer great potential to improve the resilience of elite cultivars. Utilizing this diversity requires advanced technologies to comprehensively identify genetic diversity and understand the genetic architecture of beneficial traits. The primary focus of the dissertation is developing computational tools to facilitate variant discovery and trait mapping for plant genomics. In Chapter 1, I benchmarked the performance of variant discovery algorithms based on simulated and diverse plant datasets. The comparison of sequence aligners found that BWA-MEM consistently aligned the most plant reads with high accuracy, whereas Bowtie2 had a slightly higher overall accuracy. Variant callers, such as GATK HaplotypCaller and SAMtools mpileup, were shown to significantly differ in their ability to minimize the frequency of false negatives and maximize the discovery of true positives. A cross-reference experiment of Solanum lycopersicum and Solanum pennellii reference genomes revealed significant limitations of using a single reference genome for variant discovery. Next, I demonstrated that a machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff filtering strategy, resulting in a significantly higher number of true positive and fewer false-positive variants. Finally, I developed a 2-step imputation method resulted in up to 60% higher accuracy than direct LD-based imputation methods. In Chapter 2, I focused on developing a trait mapping algorithm tailored for plants considering the high levels of diversity found in plant datasets. This novel trait mapping framework, HapFM, had the ability to incorporate biological priors into the mapping model to identify casual haplotypes for traits of interest. Compared to conventional GWAS analyses, the haplotype-based approach significantly reduced the number of variables while aggregating small effect SNPs to increase mapping power. HapFM could account for LD between haplotype segments to infer the causal haplotypes directly. Furthermore, HapFM could systemically incorporate biological priors into the probability function during the mapping process resulting in greater mapping resolution. Overall, HapFM achieves a balance between powerfulness, interpretability, and verifiability. In Chapter 3, I developed a computational algorithm to select a pan-genome cohort to maximize the haplotype representativeness of the cohort. Increasing evidence suggest that a single reference genome is often inadequate for plant diversity studies due to extensive sequence and structural rearrangements found in many plant genomes. HapPS was developed to utilize local haplotype information to select the reference cohort. There are three steps in HapPS, including genome-wide block partition, representative haplotype identification, and genetic algorithm for reference cohort selection. The comparison of HapPS with global-distance-based selection showed that HapPS resulted in significantly higher block coverage in the highly diverse genic regions. The GO-term enrichment analysis of the highly diverse genic region identified by HapPS showed enrichment for genes involved in defense pathways and abiotic stress, which might identify genomic regions involved in local adaptation. In summary, HapPS provides a systemic and objective solution to pan-genome cohort selection

    Modelling dependencies in genetic-marker data and its application to haplotype analysis

    Get PDF
    The objective of this thesis is to develop new methods to reconstruct haplotypes from phaseunknown genotypes. The need for new methodologies is motivated by the increasing avail¬ ability of high-resolution marker data for many species. Such markers typically exhibit correlations, a phenomenon known as Linkage Disequilibrium (LD). It is believed that re¬ constructed haplotypes for markers in high LD can be valuable for a variety of application areas in population genetics, including reconstructing population history and identifying genetic disease variantsTraditionally, haplotype reconstruction methods can be categorized according to whether they operate on a single pedigree or a collection of unrelated individuals. The thesis begins with a critical assessment of the limitations of existing methods, and then presents a uni¬ fied statistical framework that can accommodate pedigree data, unrelated individuals and tightly linked markers. The framework makes use of graphical models, where inference entails representing the relevant joint probability distribution as a graph and then using associated algorithms to facilitate computation. The graphical model formalism provides invaluable tools to facilitate model specification, visualization, and inference.Once the unified framework is developed, a broad range of simulation studies are conducted using previously published haplotype data. Important contributions include demonstrating the different ways in which the haplotype frequency distribution can impact the accuracy of both the phase assignments and haplotype frequency estimates; evaluating the effectiveness of using family data to improve accuracy for different frequency profiles; and, assessing the dangers of treating related individuals as unrelated in an association study
    • …
    corecore