74 research outputs found
A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on a pedigree
BACKGROUND: When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant. RESULTS: Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that can provide a general solution to the zero-recombinant haplotype configuration problem in O(kmn + k(2)m) time. In addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity in terms of execution time in relation to input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, in thousandth to hundredth of a second, on a personal desktop computer. CONCLUSIONS: We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension
Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been
highly successful in providing a sound combinatorial formulation for the
important problem of genotype phasing on pedigrees. Despite several algorithmic
advances and refinements that led to some efficient algorithms, its
applicability to real datasets has been limited by the absence of some
important characteristics of these data in its formulation, such as mutations,
genotyping errors, and missing data.
In this work, we propose the Haplotype Configuration with Recombinations and
Errors problem (HCRE), which generalizes the original MRHC formulation by
incorporating the two most common characteristics of real data: errors and
missing genotypes (including untyped individuals). Although HCRE is
computationally hard, we propose an exact algorithm for the problem based on a
reduction to the well-known Satisfiability problem. Our reduction exploits
recent progresses in the constraint programming literature and, combined with
the use of state-of-the-art SAT solvers, provides a practical solution for the
HCRE problem. Biological soundness of the phasing model and effectiveness (on
both accuracy and performance) of the algorithm are experimentally demonstrated
under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is
available at http://www.algolab.eu/reHCsta
Most parsimonious haplotype allele sharing determination
<p>Abstract</p> <p>Background</p> <p>The "common disease – common variant" hypothesis and genome-wide association studies have achieved numerous successes in the last three years, particularly in genetic mapping in human diseases. Nevertheless, the power of the association study methods are still low, in particular on quantitative traits, and the description of the full allelic spectrum is deemed still far from reach. Given increasing density of single nucleotide polymorphisms available and suggested by the block-like structure of the human genome, a popular and prosperous strategy is to use haplotypes to try to capture the correlation structure of SNPs in regions of little recombination. The key to the success of this strategy is thus the ability to unambiguously determine the haplotype allele sharing status among the members. The association studies based on haplotype sharing status would have significantly reduced degrees of freedom and be able to capture the combined effects of tightly linked causal variants.</p> <p>Results</p> <p>For pedigree genotype datasets of medium density of SNPs, we present two methods for haplotype allele sharing status determination among the pedigree members. Extensive simulation study showed that both methods performed nearly perfectly on breakpoint discovery, mutation haplotype allele discovery, and shared chromosomal region discovery.</p> <p>Conclusion</p> <p>For pedigree genotype datasets, the haplotype allele sharing status among the members can be deterministically, efficiently, and accurately determined, even for very small pedigrees. Given their excellent performance, the presented haplotype allele sharing status determination programs can be useful in many downstream applications including haplotype based association studies.</p
Population genetics of identity by descent
Recent improvements in high-throughput genotyping and sequencing technologies
have afforded the collection of massive, genome-wide datasets of DNA
information from hundreds of thousands of individuals. These datasets, in turn,
provide unprecedented opportunities to reconstruct the history of human
populations and detect genotype-phenotype association. Recently developed
computational methods can identify long-range chromosomal segments that are
identical across samples, and have been transmitted from common ancestors that
lived tens to hundreds of generations in the past. These segments reveal
genealogical relationships that are typically unknown to the carrying
individuals. In this work, we demonstrate that such identical-by-descent (IBD)
segments are informative about a number of relevant population genetics
features: they enable the inference of details about past population size
fluctuations, migration events, and they carry the genomic signature of natural
selection. We derive a mathematical model, based on coalescent theory, that
allows for a quantitative description of IBD sharing across purportedly
unrelated individuals, and develop inference procedures for the reconstruction
of recent demographic events, where classical methodologies are statistically
underpowered. We analyze IBD sharing in several contemporary human populations,
including representative communities of the Jewish Diaspora, Kenyan Maasai
samples, and individuals from several Dutch provinces, in all cases retrieving
evidence of fine-scale demographic events from recent history. Finally, we
expand the presented model to describe distributions for those sites in IBD
shared segments that harbor mutation events, showing how these may be used for
the inference of mutation rates in humans and other species.Comment: Ph.D. thesi
Genomic insights into fine-scale recombination variation in adaptively diverging threespine stickleback fish (Gasterosteus aculeatus)
Meiotic recombination is one of the major molecular mechanisms generating
genetic diversity and influencing genome evolution. By shuffling allelic
combinations, it can directly influence the patterns and efficacy of natural
selection. Studies in various organisms have shown that the rate and placement of
recombination varies substantially within the genome, among individuals,
between sexes and among different species. It is hypothesized that this variation
plays an important role in genome evolution. In this PhD thesis, I investigated the
extent and molecular basis of recombination variation in adaptively diverging
threespine stickleback fish (Gasterosteus aculeatus) to further understand its
evolutionary implications. I used both ChIP-sequencing and whole genome
sequencing of pedigrees to empirically identify and quantify double strand breaks
(DSBs) and meiotic crossovers (COs). Whole genome sequencing of large nuclear
families was performed to identify meiotic crossovers in 36 individuals of
diverging marine and freshwater ecotypes and their hybrids. This produced the
first genome-wide high-resolution sex-specific and ecotype-specific map of
contemporary recombination events in sticklebacks. The results show striking
differences in crossover number and placement between sexes. Females recombine
nearly 1.76 times more than males and their COs are distributed all over the
chromosome while male COs predominantly occur near the chromosomal
periphery. When compared among ecotypes a significant reduction in overall
recombination rate was observed in hybrid females compared to pure forms. Even
though the known loci underlying marine-freshwater adaptive divergence tend to
fall in regions of low recombination, considerable female recombination is
observed in the regions between adaptive loci. This suggests that the sexual
dimorphism in recombination phenotype may have important evolutionary
implications.
At the fine-scale, COs and male DSBs are nonrandomly distributed
involving ‘semi-hot’ hotspots and coldspots of recombination. I report a significant
association of male DSBs and COs with functionally active open chromatin regions
like gene promoters, whereas female COs did not show an association more than
expected by chance. However, a considerable number of COs and DSBs away from
any of the tested open chromatin marks suggests possibility of additional novel
mechanisms of recombination regulation in sticklebacks.
In addition, we developed a novel method for constructing individualized
recombination maps from pooled gamete DNA using linked read sequencing
technology by 10X Genomics®. We tested the method by contrasting recombination
profiles of gametic and somatic tissue from a hybrid mouse and stickleback fish.
Our pipeline faithfully detects previously described recombination hotspots in
mice at high resolution and identify many novel hotspots across the genome in
both species and thereby demonstrate the efficiency of the novel method. This
method could be employed for large scale QTL mapping studies to further
understand the genetic basis of recombination variation reported in this thesis.
By bridging the gap between natural populations and lab organisms with
large clutch sizes and tractable genetic tools, this work shows the utility of the
stickleback system and provides important groundwork for further studies of
heterochiasmy and divergence in recombination during adaptation to differing
environments
Fast and Accurate Haplotype Inference with Hidden Markov Model
The genome of human and other diploid organisms consists of paired chromosomes. The haplotype information (DNA constellation on one single chromosome), which is crucial for disease association analysis and population genetic inference among many others, is however hidden in the data generated for diploid organisms (including human) by modern high-throughput technologies which cannot distinguish information from two homologous chromosomes. Here, I consider the haplotype inference problem in two common scenarios of genetic studies: 1. Model organisms (such as laboratory mice): Individuals are bred through prescribed pedigree design. 2. Out-bred organisms (such as human): Individuals (mostly unrelated) are drawn from one or more populations or continental groups. In the two scenarios, one individual may share short blocks of chromosomes with other individual(s) or with founder(s) if available. I have developed and implemented methods, by identifying the shared blocks statistically, to accurately and more rapidly reconstruct the haplotypes for individuals under study and to solve important related problems including genotype imputation and ancestry inference. My methods, based on hidden Markov model, can scale up to tens of thousands of individuals. Analysis based on my method leads to a new genetic map in mouse population which reveals important biological properties of the recombination process. I have also explored the study design and empirical quality control for imputation tasks with large scale datasets from admixed population.Doctor of Philosoph
Haplotype phasing after joint estimation of recombination and linkage disequilibrium in breeding populations
A novel method for haplotype phasing in families after joint estimation of recombination fraction and linkage disequilibrium is developed. Results from Monte Carlo computer simulations show that the newly developed E.M. algorithm is accurate if true recombination fraction is 0 even for single families of relatively small sizes. Estimates of recombination fraction and linkage disequilibrium were 0.00 (SD 0.00) and 0.19 (SD 0.03) for simulated recombination fraction and linkage disequilibrium of 0.00 and 0.20, respectively. A genome fragmentation phasing strategy was developed and used for phasing haplotypes in a sire and 36 progeny using the 50 k Illumina BeadChip by: a) estimation of the recombination fraction and LD in consecutive SNPs using family information, b) linkage analyses between fragments, c) phasing of haplotypes in parents and progeny and in following generations. Homozygous SNPs in progeny allowed determination of paternal fragment inheritance, and deduction of SNP sequence information of haplotypes from dams. The strategy also allowed detection of genotyping errors. A total of 613 recombination events were detected after linkage analysis was carried out between fragments. Hot and cold spots were identified at the individual (sire level). SNPs for which the sire and calf were heterozygotes became informative (over 90%) after the phasing of haplotypes. Average of regions of identity between half-sibs when comparing its maternal inherited haplotypes (with at least 20 SNP) in common was 0.11 with a maximum of 0.29 and a minimum of 0.05. A Monte-Carlo simulation of BTA1 with the same linkage disequilibrium structure and genetic linkage as the cattle family yielded a 99.98 and 99.94% of correct phases for informative SNPs in sire and calves, respectively
- …