259 research outputs found

    Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers

    Full text link
    The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these data in its formulation, such as mutations, genotyping errors, and missing data. In this work, we propose the Haplotype Configuration with Recombinations and Errors problem (HCRE), which generalizes the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). Although HCRE is computationally hard, we propose an exact algorithm for the problem based on a reduction to the well-known Satisfiability problem. Our reduction exploits recent progresses in the constraint programming literature and, combined with the use of state-of-the-art SAT solvers, provides a practical solution for the HCRE problem. Biological soundness of the phasing model and effectiveness (on both accuracy and performance) of the algorithm are experimentally demonstrated under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is available at http://www.algolab.eu/reHCsta

    Efficient genome ancestry inference in complex pedigrees with inbreeding

    Get PDF
    Motivation: High-density SNP data of model animal resources provides opportunities for fine-resolution genetic variation studies. These genetic resources are generated through a variety of breeding schemes that involve multiple generations of matings derived from a set of founder animals. In this article, we investigate the problem of inferring the most probable ancestry of resulting genotypes, given a set of founder genotypes. Due to computational difficulty, existing methods either handle only small pedigree data or disregard the pedigree structure. However, large pedigrees of model animal resources often contain repetitive substructures that can be utilized in accelerating computation

    Rapid haplotype inference for nuclear families

    Get PDF
    Hapi is a new dynamic programming algorithm that ignores uninformative states and state transitions in order to efficiently compute minimum-recombinant and maximum likelihood haplotypes. When applied to a dataset containing 103 families, Hapi performs 3.8 and 320 times faster than state-of-the-art algorithms. Because Hapi infers both minimum-recombinant and maximum likelihood haplotypes and applies to related individuals, the haplotypes it infers are highly accurate over extended genomic distances.National Institutes of Health (U.S.) (NIH grant 5-T90-DK070069)National Institutes of Health (U.S.) (Grant 5-P01-NS055923)National Science Foundation (U.S.) (Graduate Research Fellowship

    Haplotype inference in general pedigrees with two sites

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with two sites for all members.</p> <p>Results</p> <p>We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem and therefore can be solved by an <it>O</it>(2<it><sup>k</sup></it> · <it>n</it><sup>2</sup>) exact algorithm, where <it>n</it> is the number of members and <it>k</it> is the number of recombination events.</p> <p>Conclusions</p> <p>Our work can therefore be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.</p

    Haplotypes versus genotypes on pedigrees

    Get PDF
    Abstract. Genome sequencing will soon produce haplotype data for individuals. For pedigrees of related individuals, sequencing appears to be an attractive alternative to genotyping. However, methods for pedigree analysis with haplotype data have not yet been developed, and the computational complexity of such problems has been an open question. Furthermore, it is not clear in which scenarios haplotype data would provide better estimates than genotype data for quantities such as recombination rates. To answer these questions, a reduction is given from genotype problem instances to haplotype problem instances, and it is shown that solving the haplotype problem yields the solution to the genotype problem, up to constant factors or coefficients. The pedigree analysis problems we will consider are the likelihood, maximum probability haplotype, and minimum recombination haplotype problems. Two algorithms are introduced: an exponential-time hidden Markov model (HMM) for haplotype data where some individuals are untyped, and a linear-time algorithm for pedigrees having haplotype data for all individuals. Recombination estimates from the general haplotype HMM algorithm are compared to recombination estimates produced by a genotype HMM. Having haplotype data on all individuals produces better estimates. However, having several untyped individuals can drastically reduce the utility of haplotype data. Pedigree analysis, both linkage and association studies, has a long history of important contributions to genetics, including disease-gene finding and some of the first genetic maps for humans. Recent contributions include fine-scale recombination maps in humans [4], regions linked to Schizophrenia that might be missed by genome-wide association studies [11], and insights into the relationship between cystic fibrosis and fertility [13]. Algorithms for pedigree problems are of great interest to the computer science community, in part because of connections to machine learning algorithms, optimization methods, and combinatorics [7, 16

    A genetic algorithm based method for stringent haplotyping of family data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The linkage phase, or haplotype, is an extra level of information that in addition to genotype and pedigree can be useful for reconstructing the inheritance pattern of the alleles in a pedigree, and computing for example Identity By Descent probabilities. If a haplotype is provided, the precision of estimated IBD probabilities increases, as long as the haplotype is estimated without errors. It is therefore important to only use haplotypes that are strongly supported by the available data for IBD estimation, to avoid introducing new errors due to erroneous linkage phases.</p> <p>Results</p> <p>We propose a genetic algorithm based method for haplotype estimation in family data that includes a stringency parameter. This allows the user to decide the error tolerance level when inferring parental origin of the alleles. This is a novel feature compared to existing methods for haplotype estimation. We show that using a high stringency produces haplotype data with few errors, whereas a low stringency provides haplotype estimates in most situations, but with an increased number of errors.</p> <p>Conclusion</p> <p>By including a stringency criterion in our haplotyping method, the user is able to maintain the error rate at a suitable level for the particular study; one can select anything from haplotyped data with very small proportion of errors and a higher proportion of non-inferred haplotypes, to data with phase estimates for every marker, when haplotype errors are tolerable. Giving this choice makes the method more flexible and useful in a wide range of applications as it is able to fulfil different requirements regarding the tolerance for haplotype errors, or uncertain marker-phases.</p

    Bayesian Inference for Retrospective Population Genetics Models Using Markov Chain Monte Carlo Methods

    Get PDF
    Genetics, the science of heredity and variation in living organisms, has a central role in medicine, in breeding crops and livestock, and in studying fundamental topics of biological sciences such as evolution and cell functioning. Currently the field of genetics is under a rapid development because of the recent advances in technologies by which molecular data can be obtained from living organisms. In order that most information from such data can be extracted, the analyses need to be carried out using statistical models that are tailored to take account of the particular genetic processes. In this thesis we formulate and analyze Bayesian models for genetic marker data of contemporary individuals. The major focus is on the modeling of the unobserved recent ancestry of the sampled individuals (say, for tens of generations or so), which is carried out by using explicit probabilistic reconstructions of the pedigree structures accompanied by the gene flows at the marker loci. For such a recent history, the recombination process is the major genetic force that shapes the genomes of the individuals, and it is included in the model by assuming that the recombination fractions between the adjacent markers are known. The posterior distribution of the unobserved history of the individuals is studied conditionally on the observed marker data by using a Markov chain Monte Carlo algorithm (MCMC). The example analyses consider estimation of the population structure, relatedness structure (both at the level of whole genomes as well as at each marker separately), and haplotype configurations. For situations where the pedigree structure is partially known, an algorithm to create an initial state for the MCMC algorithm is given. Furthermore, the thesis includes an extension of the model for the recent genetic history to situations where also a quantitative phenotype has been measured from the contemporary individuals. In that case the goal is to identify positions on the genome that affect the observed phenotypic values. This task is carried out within the Bayesian framework, where the number and the relative effects of the quantitative trait loci are treated as random variables whose posterior distribution is studied conditionally on the observed genetic and phenotypic data. In addition, the thesis contains an extension of a widely-used haplotyping method, the PHASE algorithm, to settings where genetic material from several individuals has been pooled together, and the allele frequencies of each pool are determined in a single genotyping.Perinnöllisyystieteessä eli genetiikassa tutkitaan perinnöllisen aineksen rakennetta, toimintaa ja muuntelua sekä muita yksilöiden väliseen vaihteluun vaikuttavia tekijöitä eliökunnassa. Nykyisten laboratoriomenetelmien avulla on mahdollista kerätä eliöistä yhä tarkempia ja laajempia molekyylitason aineistoja. Tällaisten aineistojen käsittelemiseksi tarvitaan tilastollisia malleja, jotka hyödyntävät mahdollisimman tarkasti käytettävissä olevaa tietämystä biologisista prosesseista, joiden tuloksena kerätyt aineistot ovat muodostuneet. Tässä väitöskirjassa kehitetään Bayesläisen tilastotieteen malleja eräille geneettisille prosesseille sekä sovelletaan malleja esimerkkiaineistoihin. Pääpaino on yksilöiden yhteisen lähihistorian mallittamisessa. Yksinkertaisimmillaan lähtökohtana on joukko nykyhetken yksilöitä, joiden perinnöllinen aines oletetaan tunnetuksi tietyissä merkkigeenikohdissa laboratoriossa suoritettujen genotyyppimittausten perusteella. Tilastollista mallia käytetään arvioimaan todennäköisyyksiä erilaisille yksilöitä yhdistäville lähihistorioille, jotka kuvataan sukupuurakenteiden sekä merkkigeenien periytymisreittien avulla. Tarkasteltavat aikajaksot ovat enintään kymmeniä sukupolvia. Väitöskirjassa myös hyödynnetään lähihistoriamallia geenikartoitussovelluksessa, jonka tavoitteena on paikallistaa sellaisia kohtia genomista, joilla on vaikutusta tiettyyn yksilöistä mitattuun tai havaittuun ominaisuuteen. Muita sovelluskohteita ovat populaatiorakenteen arviointi sekä yksilöiden välisten sukulaisuusasteiden arviointi
    corecore