911 research outputs found
On the Complexity of the Single Individual SNP Haplotyping Problem
We present several new results pertaining to haplotyping. These results
concern the combinatorial problem of reconstructing haplotypes from incomplete
and/or imperfectly sequenced haplotype fragments. We consider the complexity of
the problems Minimum Error Correction (MEC) and Longest Haplotype
Reconstruction (LHR) for different restrictions on the input data.
Specifically, we look at the gapless case, where every row of the input
corresponds to a gapless haplotype-fragment, and the 1-gap case, where at most
one gap per fragment is allowed. We prove that MEC is APX-hard in the 1-gap
case and still NP-hard in the gapless case. In addition, we question earlier
claims that MEC is NP-hard even when the input matrix is restricted to being
completely binary. Concerning LHR, we show that this problem is NP-hard and
APX-hard in the 1-gap case (and thus also in the general case), but is
polynomial time solvable in the gapless case.Comment: 26 pages. Related to the WABI2005 paper, "On the Complexity of
Several Haplotyping Problems", but with more/different results. This papers
has just been submitted to the IEEE/ACM Transactions on Computational Biology
and Bioinformatics and we are awaiting a decision on acceptance. It differs
from the mid-August version of this paper because here we prove that 1-gap
LHR is APX-hard. (In the earlier version of the paper we could prove only
that it was NP-hard.
Pure Parsimony Xor Haplotyping
The haplotype resolution from xor-genotype data has been recently formulated
as a new model for genetic studies. The xor-genotype data is a cheaply
obtainable type of data distinguishing heterozygous from homozygous sites
without identifying the homozygous alleles. In this paper we propose a
formulation based on a well-known model used in haplotype inference: pure
parsimony. We exhibit exact solutions of the problem by providing polynomial
time algorithms for some restricted cases and a fixed-parameter algorithm for
the general case. These results are based on some interesting combinatorial
properties of a graph representation of the solutions. Furthermore, we show
that the problem has a polynomial time k-approximation, where k is the maximum
number of xor-genotypes containing a given SNP. Finally, we propose a heuristic
and produce an experimental analysis showing that it scales to real-world large
instances taken from the HapMap project
Estimation of N-acetyltransferase 2 haplotypes
N-Acetyltransferase 2 (NAT2) genotyping may result in a considerable percentage in several ambiguous allele combinations. PHASE 2.1 is a statistical program which is designed to estimate the probability of different allele combinations. We have investigated haplotypes of 2088 subjects genotyped for NAT2 according to standard PCR/RFLP methods. In 856 out of 2088 cases the genotype was clearly defined by PCR/RFLP only. In many of the remaining cases the program clearly defined the most probable allele combination: In the case of *5A/*6C, *5B/*6A the probability for *5B/*6A is 99% whereas the alternative allele combination *5A/*6C can be neglected. Other combinations cannot be allocated with a comparable high probability. For example the allele combination *5A/*5C, *5B/*5D provides for *5A/*5C a probability of 69% whereas the estimation for *5B/*5D allele is only 31%. In the two most often observed constellations in our data [(*12A/*5B, *12C/*5C); (*12A/*6A, *12B/*6B, *4/*6C)] the probability of allele combination was ascertained as follows: *12A/*5B, 98%; *12C/*5C, 1.4% and *12A/*6A, 82%; *4/*6C, 17%; *12B/*6B, 0%. The estimation of the NAT2 haplotype is important because the assignment of the NAT2 alleles *12A, *12B or *13 as a rapid or slow genotype has been discussed controversially. Otherwise the classification of alleles in subjects which are not showing a clearly allocation can result in a rapid or slow acetylation state. This assignment has an important role in survey of bladder cancer cases in the scope of occupational exposure with aromatic amines. --PHASE 2.1,NAT2 genotyping,single nucleotide polymorphism
Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm
Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.National Science Foundation CCF-1320273Electrical and Computer Engineerin
Minimum error correction-based haplotype assembly: considerations for long read data
The single nucleotide polymorphism (SNP) is the most widely studied type of
genetic variation. A haplotype is defined as the sequence of alleles at SNP
sites on each haploid chromosome. Haplotype information is essential in
unravelling the genome-phenotype association. Haplotype assembly is a
well-known approach for reconstructing haplotypes, exploiting reads generated
by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often
used for reconstruction of haplotypes from reads. However, problems with the
MEC metric have been reported. Here, we investigate the MEC approach to
demonstrate that it may result in incorrectly reconstructed haplotypes for
devices that produce error-prone long reads. Specifically, we evaluate this
approach for devices developed by Illumina, Pacific BioSciences and Oxford
Nanopore Technologies. We show that imprecise haplotypes may be reconstructed
with a lower MEC than that of the exact haplotype. The performance of MEC is
explored for different coverage levels and error rates of data. Our simulation
results reveal that in order to avoid incorrect MEC-based haplotypes, a
coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure
PWHATSHAP: efficient haplotyping for future generation sequencing
Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the con dence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WhatsHap is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e. coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Results: Given the potential relevance of ecient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard le formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage. Conclusions: Due to its structure and management of the large datasets, the parallelisation of WhatsHap posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, pWhatsHap, is a freely available toolkit that improves the eciency of the analysis of genomics information
Algorithmic approaches for the single individual haplotyping problem
Since its introduction in 2001, the Single Individual Haplotyping problem has received an ever-increasing attention from the scientific community. In this paper we survey, in the form of an annotated bibliography, the developments in the study of the problem from its origin until our days
- …