3,505 research outputs found
On the fixed parameter tractability and approximability of the minimum error correction problem
Haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum Error Correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections.
By using novel combinatorial properties of MEC instances, we are able to provide new results on the fixed-parameter tractability and approximability of MEC. In particular, we show that MEC is in FPT when parameterized by the number of corrections, and, on “gapless” instances, it is in FPT also when parameterized by the length of the fragments, whereas the result known in literature forces the reconstruction of complementary haplotypes. Then, we show that MEC cannot be approximated within
any constant factor while it is approximable within factor O(log nm) where nm is the size of the input. Finally, we provide a practical 2-approximation algorithm for the Binary MEC, a variant of MEC that has been applied in the framework of clustering binary data
On the Complexity of the Single Individual SNP Haplotyping Problem
We present several new results pertaining to haplotyping. These results
concern the combinatorial problem of reconstructing haplotypes from incomplete
and/or imperfectly sequenced haplotype fragments. We consider the complexity of
the problems Minimum Error Correction (MEC) and Longest Haplotype
Reconstruction (LHR) for different restrictions on the input data.
Specifically, we look at the gapless case, where every row of the input
corresponds to a gapless haplotype-fragment, and the 1-gap case, where at most
one gap per fragment is allowed. We prove that MEC is APX-hard in the 1-gap
case and still NP-hard in the gapless case. In addition, we question earlier
claims that MEC is NP-hard even when the input matrix is restricted to being
completely binary. Concerning LHR, we show that this problem is NP-hard and
APX-hard in the 1-gap case (and thus also in the general case), but is
polynomial time solvable in the gapless case.Comment: 26 pages. Related to the WABI2005 paper, "On the Complexity of
Several Haplotyping Problems", but with more/different results. This papers
has just been submitted to the IEEE/ACM Transactions on Computational Biology
and Bioinformatics and we are awaiting a decision on acceptance. It differs
from the mid-August version of this paper because here we prove that 1-gap
LHR is APX-hard. (In the earlier version of the paper we could prove only
that it was NP-hard.
Haplotype Assembly: An Information Theoretic View
This paper studies the haplotype assembly problem from an information
theoretic perspective. A haplotype is a sequence of nucleotide bases on a
chromosome, often conveniently represented by a binary string, that differ from
the bases in the corresponding positions on the other chromosome in a
homologous pair. Information about the order of bases in a genome is readily
inferred using short reads provided by high-throughput DNA sequencing
technologies. In this paper, the recovery of the target pair of haplotype
sequences using short reads is rephrased as a joint source-channel coding
problem. Two messages, representing haplotypes and chromosome memberships of
reads, are encoded and transmitted over a channel with erasures and errors,
where the channel model reflects salient features of high-throughput
sequencing. The focus of this paper is on the required number of reads for
reliable haplotype reconstruction, and both the necessary and sufficient
conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa
Viral population estimation using pyrosequencing
The diversity of virus populations within single infected hosts presents a
major difficulty for the natural immune response as well as for vaccine design
and antiviral drug therapy. Recently developed pyrophosphate based sequencing
technologies (pyrosequencing) can be used for quantifying this diversity by
ultra-deep sequencing of virus samples. We present computational methods for
the analysis of such sequence data and apply these techniques to pyrosequencing
data obtained from HIV populations within patients harboring drug resistant
virus strains. Our main result is the estimation of the population structure of
the sample from the pyrosequencing reads. This inference is based on a
statistical approach to error correction, followed by a combinatorial algorithm
for constructing a minimal set of haplotypes that explain the data. Using this
set of explaining haplotypes, we apply a statistical model to infer the
frequencies of the haplotypes in the population via an EM algorithm. We
demonstrate that pyrosequencing reads allow for effective population
reconstruction by extensive simulations and by comparison to 165 sequences
obtained directly from clonal sequencing of four independent, diverse HIV
populations. Thus, pyrosequencing can be used for cost-effective estimation of
the structure of virus populations, promising new insights into viral
evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
Read-based Phasing of Related Individuals
Motivation: Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information—reads and pedigree—has the potential to deliver results better than each individually. Results: We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual. Availability and Implementation: https://bitbucket.org/whatshap/whatshap Contact: [email protected]
Recent advances in inferring viral diversity from high-throughput sequencing data
Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.ISSN:0168-170
- …