21 research outputs found

    Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers

    Full text link
    The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these data in its formulation, such as mutations, genotyping errors, and missing data. In this work, we propose the Haplotype Configuration with Recombinations and Errors problem (HCRE), which generalizes the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). Although HCRE is computationally hard, we propose an exact algorithm for the problem based on a reduction to the well-known Satisfiability problem. Our reduction exploits recent progresses in the constraint programming literature and, combined with the use of state-of-the-art SAT solvers, provides a practical solution for the HCRE problem. Biological soundness of the phasing model and effectiveness (on both accuracy and performance) of the algorithm are experimentally demonstrated under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is available at http://www.algolab.eu/reHCsta

    Most parsimonious haplotype allele sharing determination

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The "common disease – common variant" hypothesis and genome-wide association studies have achieved numerous successes in the last three years, particularly in genetic mapping in human diseases. Nevertheless, the power of the association study methods are still low, in particular on quantitative traits, and the description of the full allelic spectrum is deemed still far from reach. Given increasing density of single nucleotide polymorphisms available and suggested by the block-like structure of the human genome, a popular and prosperous strategy is to use haplotypes to try to capture the correlation structure of SNPs in regions of little recombination. The key to the success of this strategy is thus the ability to unambiguously determine the haplotype allele sharing status among the members. The association studies based on haplotype sharing status would have significantly reduced degrees of freedom and be able to capture the combined effects of tightly linked causal variants.</p> <p>Results</p> <p>For pedigree genotype datasets of medium density of SNPs, we present two methods for haplotype allele sharing status determination among the pedigree members. Extensive simulation study showed that both methods performed nearly perfectly on breakpoint discovery, mutation haplotype allele discovery, and shared chromosomal region discovery.</p> <p>Conclusion</p> <p>For pedigree genotype datasets, the haplotype allele sharing status among the members can be deterministically, efficiently, and accurately determined, even for very small pedigrees. Given their excellent performance, the presented haplotype allele sharing status determination programs can be useful in many downstream applications including haplotype based association studies.</p

    Haplotypes versus genotypes on pedigrees

    Get PDF
    Abstract. Genome sequencing will soon produce haplotype data for individuals. For pedigrees of related individuals, sequencing appears to be an attractive alternative to genotyping. However, methods for pedigree analysis with haplotype data have not yet been developed, and the computational complexity of such problems has been an open question. Furthermore, it is not clear in which scenarios haplotype data would provide better estimates than genotype data for quantities such as recombination rates. To answer these questions, a reduction is given from genotype problem instances to haplotype problem instances, and it is shown that solving the haplotype problem yields the solution to the genotype problem, up to constant factors or coefficients. The pedigree analysis problems we will consider are the likelihood, maximum probability haplotype, and minimum recombination haplotype problems. Two algorithms are introduced: an exponential-time hidden Markov model (HMM) for haplotype data where some individuals are untyped, and a linear-time algorithm for pedigrees having haplotype data for all individuals. Recombination estimates from the general haplotype HMM algorithm are compared to recombination estimates produced by a genotype HMM. Having haplotype data on all individuals produces better estimates. However, having several untyped individuals can drastically reduce the utility of haplotype data. Pedigree analysis, both linkage and association studies, has a long history of important contributions to genetics, including disease-gene finding and some of the first genetic maps for humans. Recent contributions include fine-scale recombination maps in humans [4], regions linked to Schizophrenia that might be missed by genome-wide association studies [11], and insights into the relationship between cystic fibrosis and fertility [13]. Algorithms for pedigree problems are of great interest to the computer science community, in part because of connections to machine learning algorithms, optimization methods, and combinatorics [7, 16

    Algorithms for Computational Genetics Epidemiology

    Get PDF
    The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility

    Selected Works in Bioinformatics

    Get PDF
    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    Genomic Selection, Quantitative Trait Loci and Genome-Wide Association Mapping for Spring Bread Wheat (Triticum aestivum L.) Improvement

    Get PDF
    Molecular breeding involves the use of molecular markers to identify and characterize genes that control quantitative traits. Two of the most commonly used methods to dissect complex traits in plants are linkage analysis and association mapping. These methods are used to identify markers associated with quantitative trait loci (QTL) that underlie trait variation, which are used for marker assisted selection (MAS). Marker assisted selection has been successful to improve traits controlled by moderate to large effect QTL; however, it has limited application for traits controlled by many QTL with small effects. Genomic selection (GS) is suggested to overcome the limitation of MAS and improve genetic gain of quantitative traits. GS is a type of MAS that estimates the effects of genome-wide markers to calculate genomic estimated breeding values (GEBVs) for individuals without phenotypic records. In recent years, GS is gaining momentum in crop breeding programs but there is limited empirical evidence for practical application. The objectives of this study were to: i) evaluate the performance of various statistical approaches and models to predict agronomic and end-use quality traits using empirical data in spring bread wheat, ii) determine the effects of training population (TP) size, marker density, and population structure on genomic prediction accuracy, iii) examine GS prediction accuracy when modelling genotype-by-environment interaction (G × E) using different approaches, iv) detect marker-trait associations for agronomic and end-use quality traits in spring bread wheat, v) evaluate the effects of TP composition, cross-validation technique, and genetic relationship between the TP and SC on GS accuracy, and vi) compare genomic and phenotypic prediction accuracy. Six studies were conducted to meet these objectives using two populations of 231 and 304 spring bread wheat lines that were genotyped with the wheat 90K SNP array and phenotyped for nine agronomic and end-use quality traits. The main finding across these studies is that GS can accurately predict GEBVs for wheat traits and can be used to make predictions in different environments; thus, GS should be applied in wheat breeding programs. Each study provides specific insights into some of the advantages and limitations of different GS approaches, and gives recommendations for the application of GS in future breeding programs. Specific recommendations include using the GS model BayesB (especially for large effect QTL) for genomic prediction in a single environment, across-year genomic prediction using the reaction norm model, using a large TP size for making accurate genomic predictions, and not making across-population genomic predictions except for highly related population

    Genetic mapping in polyploids

    Get PDF
    Many of our most important crop species are polyploid – an unusual phenomenon whereby each chromosome is present in multiple copies (more than the usual two copies). The most common such arrangement is tetraploidy, where each chromosome is present four times. Plant species can tolerate this condition quite well (the same cannot be said of animals or humans). In fact, polyploidy can confer certain advantages such as larger fruits and flowers, seedless fruits (useful for fruit growers) or improved tolerance to environmental stresses. However, carrying multiple copies of each chromosome complicates things, particularly when crop breeders would like to use DNA information to help inform selection decisions. This PhD project looked at how DNA information of polyploids should be best analysed, developing methods and new software tools to achieve this. We analysed DNA information from polyploid crops such as potato, rose and chrysanthemum, yielding many novel insights and important results.</p
    corecore