74 research outputs found

    A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on a pedigree

    Get PDF
    BACKGROUND: When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant. RESULTS: Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that can provide a general solution to the zero-recombinant haplotype configuration problem in O(kmn + k(2)m) time. In addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity in terms of execution time in relation to input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, in thousandth to hundredth of a second, on a personal desktop computer. CONCLUSIONS: We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension

    Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers

    Full text link
    The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and refinements that led to some efficient algorithms, its applicability to real datasets has been limited by the absence of some important characteristics of these data in its formulation, such as mutations, genotyping errors, and missing data. In this work, we propose the Haplotype Configuration with Recombinations and Errors problem (HCRE), which generalizes the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). Although HCRE is computationally hard, we propose an exact algorithm for the problem based on a reduction to the well-known Satisfiability problem. Our reduction exploits recent progresses in the constraint programming literature and, combined with the use of state-of-the-art SAT solvers, provides a practical solution for the HCRE problem. Biological soundness of the phasing model and effectiveness (on both accuracy and performance) of the algorithm are experimentally demonstrated under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is available at http://www.algolab.eu/reHCsta

    Most parsimonious haplotype allele sharing determination

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The "common disease – common variant" hypothesis and genome-wide association studies have achieved numerous successes in the last three years, particularly in genetic mapping in human diseases. Nevertheless, the power of the association study methods are still low, in particular on quantitative traits, and the description of the full allelic spectrum is deemed still far from reach. Given increasing density of single nucleotide polymorphisms available and suggested by the block-like structure of the human genome, a popular and prosperous strategy is to use haplotypes to try to capture the correlation structure of SNPs in regions of little recombination. The key to the success of this strategy is thus the ability to unambiguously determine the haplotype allele sharing status among the members. The association studies based on haplotype sharing status would have significantly reduced degrees of freedom and be able to capture the combined effects of tightly linked causal variants.</p> <p>Results</p> <p>For pedigree genotype datasets of medium density of SNPs, we present two methods for haplotype allele sharing status determination among the pedigree members. Extensive simulation study showed that both methods performed nearly perfectly on breakpoint discovery, mutation haplotype allele discovery, and shared chromosomal region discovery.</p> <p>Conclusion</p> <p>For pedigree genotype datasets, the haplotype allele sharing status among the members can be deterministically, efficiently, and accurately determined, even for very small pedigrees. Given their excellent performance, the presented haplotype allele sharing status determination programs can be useful in many downstream applications including haplotype based association studies.</p

    Population genetics of identity by descent

    Get PDF
    Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Comment: Ph.D. thesi

    Genomic insights into fine-scale recombination variation in adaptively diverging threespine stickleback fish (Gasterosteus aculeatus)

    Get PDF
    Meiotic recombination is one of the major molecular mechanisms generating genetic diversity and influencing genome evolution. By shuffling allelic combinations, it can directly influence the patterns and efficacy of natural selection. Studies in various organisms have shown that the rate and placement of recombination varies substantially within the genome, among individuals, between sexes and among different species. It is hypothesized that this variation plays an important role in genome evolution. In this PhD thesis, I investigated the extent and molecular basis of recombination variation in adaptively diverging threespine stickleback fish (Gasterosteus aculeatus) to further understand its evolutionary implications. I used both ChIP-sequencing and whole genome sequencing of pedigrees to empirically identify and quantify double strand breaks (DSBs) and meiotic crossovers (COs). Whole genome sequencing of large nuclear families was performed to identify meiotic crossovers in 36 individuals of diverging marine and freshwater ecotypes and their hybrids. This produced the first genome-wide high-resolution sex-specific and ecotype-specific map of contemporary recombination events in sticklebacks. The results show striking differences in crossover number and placement between sexes. Females recombine nearly 1.76 times more than males and their COs are distributed all over the chromosome while male COs predominantly occur near the chromosomal periphery. When compared among ecotypes a significant reduction in overall recombination rate was observed in hybrid females compared to pure forms. Even though the known loci underlying marine-freshwater adaptive divergence tend to fall in regions of low recombination, considerable female recombination is observed in the regions between adaptive loci. This suggests that the sexual dimorphism in recombination phenotype may have important evolutionary implications. At the fine-scale, COs and male DSBs are nonrandomly distributed involving ‘semi-hot’ hotspots and coldspots of recombination. I report a significant association of male DSBs and COs with functionally active open chromatin regions like gene promoters, whereas female COs did not show an association more than expected by chance. However, a considerable number of COs and DSBs away from any of the tested open chromatin marks suggests possibility of additional novel mechanisms of recombination regulation in sticklebacks. In addition, we developed a novel method for constructing individualized recombination maps from pooled gamete DNA using linked read sequencing technology by 10X Genomics®. We tested the method by contrasting recombination profiles of gametic and somatic tissue from a hybrid mouse and stickleback fish. Our pipeline faithfully detects previously described recombination hotspots in mice at high resolution and identify many novel hotspots across the genome in both species and thereby demonstrate the efficiency of the novel method. This method could be employed for large scale QTL mapping studies to further understand the genetic basis of recombination variation reported in this thesis. By bridging the gap between natural populations and lab organisms with large clutch sizes and tractable genetic tools, this work shows the utility of the stickleback system and provides important groundwork for further studies of heterochiasmy and divergence in recombination during adaptation to differing environments

    Fast and Accurate Haplotype Inference with Hidden Markov Model

    Get PDF
    The genome of human and other diploid organisms consists of paired chromosomes. The haplotype information (DNA constellation on one single chromosome), which is crucial for disease association analysis and population genetic inference among many others, is however hidden in the data generated for diploid organisms (including human) by modern high-throughput technologies which cannot distinguish information from two homologous chromosomes. Here, I consider the haplotype inference problem in two common scenarios of genetic studies: 1. Model organisms (such as laboratory mice): Individuals are bred through prescribed pedigree design. 2. Out-bred organisms (such as human): Individuals (mostly unrelated) are drawn from one or more populations or continental groups. In the two scenarios, one individual may share short blocks of chromosomes with other individual(s) or with founder(s) if available. I have developed and implemented methods, by identifying the shared blocks statistically, to accurately and more rapidly reconstruct the haplotypes for individuals under study and to solve important related problems including genotype imputation and ancestry inference. My methods, based on hidden Markov model, can scale up to tens of thousands of individuals. Analysis based on my method leads to a new genetic map in mouse population which reveals important biological properties of the recombination process. I have also explored the study design and empirical quality control for imputation tasks with large scale datasets from admixed population.Doctor of Philosoph

    Haplotype phasing after joint estimation of recombination and linkage disequilibrium in breeding populations

    Get PDF
    A novel method for haplotype phasing in families after joint estimation of recombination fraction and linkage disequilibrium is developed. Results from Monte Carlo computer simulations show that the newly developed E.M. algorithm is accurate if true recombination fraction is 0 even for single families of relatively small sizes. Estimates of recombination fraction and linkage disequilibrium were 0.00 (SD 0.00) and 0.19 (SD 0.03) for simulated recombination fraction and linkage disequilibrium of 0.00 and 0.20, respectively. A genome fragmentation phasing strategy was developed and used for phasing haplotypes in a sire and 36 progeny using the 50 k Illumina BeadChip by: a) estimation of the recombination fraction and LD in consecutive SNPs using family information, b) linkage analyses between fragments, c) phasing of haplotypes in parents and progeny and in following generations. Homozygous SNPs in progeny allowed determination of paternal fragment inheritance, and deduction of SNP sequence information of haplotypes from dams. The strategy also allowed detection of genotyping errors. A total of 613 recombination events were detected after linkage analysis was carried out between fragments. Hot and cold spots were identified at the individual (sire level). SNPs for which the sire and calf were heterozygotes became informative (over 90%) after the phasing of haplotypes. Average of regions of identity between half-sibs when comparing its maternal inherited haplotypes (with at least 20 SNP) in common was 0.11 with a maximum of 0.29 and a minimum of 0.05. A Monte-Carlo simulation of BTA1 with the same linkage disequilibrium structure and genetic linkage as the cattle family yielded a 99.98 and 99.94% of correct phases for informative SNPs in sire and calves, respectively
    • …
    corecore