Search CORE

57 research outputs found

Minimum error correction-based haplotype assembly: considerations for long read data

Author: de Ridder Dick
Kahaei Mohammad Hossein
Majidian Sina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

On the Complexity of the Single Individual SNP Haplotyping Problem

Author: Cilibrasi Rudi
Kelk Steven
Tromp John
van Iersel Leo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

We present several new results pertaining to haplotyping. These results concern the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype fragments. We consider the complexity of the problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR) for different restrictions on the input data. Specifically, we look at the gapless case, where every row of the input corresponds to a gapless haplotype-fragment, and the 1-gap case, where at most one gap per fragment is allowed. We prove that MEC is APX-hard in the 1-gap case and still NP-hard in the gapless case. In addition, we question earlier claims that MEC is NP-hard even when the input matrix is restricted to being completely binary. Concerning LHR, we show that this problem is NP-hard and APX-hard in the 1-gap case (and thus also in the general case), but is polynomial time solvable in the gapless case.Comment: 26 pages. Related to the WABI2005 paper, "On the Complexity of Several Haplotyping Problems", but with more/different results. This papers has just been submitted to the IEEE/ACM Transactions on Computational Biology and Bioinformatics and we are awaiting a decision on acceptance. It differs from the mid-August version of this paper because here we prove that 1-gap LHR is APX-hard. (In the earlier version of the paper we could prove only that it was NP-hard.

arXiv.org e-Print Archive

CiteSeerX

Maastricht University Research Portal

Repository TU/e

Crossref

CWI's Institutional Repository

Pure OAI Repository

International Migration, Integration and Social Cohesion online publications

Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm

Author: Ahn Soyeon
Vikalo Haris
Publication venue
Publication date: 01/07/2015
Field of study

Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.National Science Foundation CCF-1320273Electrical and Computer Engineerin

Springer - Publisher Connector

PubMed Central

Texas ScholarWorks

Haplotype Assembly: An Information Theoretic View

Author: Si Hongbo
Vikalo Haris
Vishwanath Sriram
Publication venue
Publication date: 11/05/2014
Field of study

This paper studies the haplotype assembly problem from an information theoretic perspective. A haplotype is a sequence of nucleotide bases on a chromosome, often conveniently represented by a binary string, that differ from the bases in the corresponding positions on the other chromosome in a homologous pair. Information about the order of bases in a genome is readily inferred using short reads provided by high-throughput DNA sequencing technologies. In this paper, the recovery of the target pair of haplotype sequences using short reads is rephrased as a joint source-channel coding problem. Two messages, representing haplotypes and chromosome memberships of reads, are encoded and transmitted over a channel with erasures and errors, where the channel model reflects salient features of high-throughput sequencing. The focus of this paper is on the required number of reads for reliable haplotype reconstruction, and both the necessary and sufficient conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa

arXiv.org e-Print Archive

Crossref

A model of higher accuracy for the individual haplotyping problem based on weighted SNP fragments and genotype with errors

Author: Adkins
Akey
Altshuler
Greenberg
J. Chen
J. Wang
Kang
Lander
Levy
M. Xie
Venter
Xie
Zhao
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: In genetic studies of complex diseases, haplotypes provide more information than genotypes. However, haplotyping is much more difficult than genotyping using biological techniques. Therefore effective computational techniques have been in demand. The individual haplotyping problem is the computational problem of inducing a pair of haplotypes from an individual's aligned SNP fragments. Based on various optimal criteria and including different extra information, many models for the problem have been proposed. Higher accuracy of the models has been an important issue in the study of haplotype reconstruction

Crossref

PubMed Central

Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques

Author: Duitama Jorge
Hoehe Margret R.
Huebsch Thomas
McEwen Gayle K.
Palczewski Stefanie
Schulz Sabrina
Suk Eun-Kyung
Verstrepen Kevin
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics

Algorithms, haplotypes and phylogenetic networks

Author: Iersel L.J.J. (Leo) van
Publication venue
Publication date: 01/01/2009
Field of study

Preface. Before I started my PhD in computational biology in 2005, I had never even heard of this term. Now, almost four years later, I think I have some idea of what is meant by it. One of the goals of my PhD was to explore different topics within computational biology and to see where the biggest opportunities for discrete/combinatorial mathematicians could be found. Roughly speaking, the first two years of my PhD I focussed mainly on problems related to haplotyping and genome rearrangements and the last two years on phylogenetic networks. I must say I really enjoyed learning so much about both mathematics and biology. It was especially amazing to learn how exact, theoretical mathematics can be used to solve complex, practical problems from biology. The topics I studied clearly show how extremely useful mathematics can be for biology. But I also learned that there are many more interesting topics in computational biology than the ones that I could study so far. The number of opportunities for discrete mathematicians is absolutely immense. I did not include my studies on genome rearrangements in this thesis, because my most interesting results [Hur07a; Hur07b] are not directly related to biology. This work is nevertheless interesting to mathematicians and I recommend them to read it. I can certainly conclude that also in this field there is a vast number of opportunities for mathematicians and that the topic genome rearrangements provides numerous beautiful mathematical problems. I could never have written this thesis without a great amount of help from many different people. I want to thank my supervisors Leen Stougie and Judith Keijsper for guiding me, for helping me, for correcting my mistakes, for supplying ideas and for the enjoyable time I had while working with them. I also want to thank the Dutch BSIK/BRICKS project for funding my research and Gerhard Woeginger for giving me the opportunity to work in his group and being my second promotor. I want to thank Jens Stoye and Julia Zakotnik for the work we did together and for the great time I had in Bielefeld. I want to thank Ferry Hagen and Teun Boekhout for helping me to make my work relevant for "real" biology. I also want to thank John Tromp, Rudi Cilibrasi, Cor Hurkens and all others I worked with during my PhD. I want to thank Erik de Vink and Mike Steel for reading and commenting my thesis. I want to thank my colleagues from the Combinatorial Optimisation group at the Technische Universiteit Eindhoven for the pleasant working conditions and the fun things we did besides work. I especially want to thank Matthias Mnich, not only a great colleague but also a good friend, for all his ideas, his humour and our good and fruitful cooperation. I also want to thank Steven Kelk. I must say that I was very lucky to work with Steven during my PhD. He introduced me to problems, had an enormous amount of ideas, found the critical mistakes in my proofs and made my PhD a success both in terms of results and in terms of enjoying work. Finally, I want to thank Conno Hendriksen and Bas Heideveld for assisting me during my PhD defence and I want to thank them and all my other friends and family for helping me with everything in my life but research

CWI's Institutional Repository

Pure OAI Repository

Haplotype estimation in polyploids using DNA sequence data

Author: Motazedi Ehsan
Publication venue: Wageningen University
Publication date: 01/01/2019
Field of study

Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p

Wageningen University & Research Publications

Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm

Author: A Panconesi
D Aguiar
D Aguiar
D He
F Deng
F Geraci
G Lancia
GR Abecasis
H Matsumoto
Haris Vikalo
J Duitama
JH Kim
K-C Liang
KC Liang
LM Li
M Xie
MR Hoehe
MS Arulampalam
MS Bayzid
R Cilibrasi
R Lippert
R Nielsen
R Schwartz
RA Gibbs
RS Wang
S Levy
Soyeon Ahn
V Bansal
V Bansal
Y Wang
YY Zhao
Z Chen
ZZ Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

HMEC: A Heuristic Algorithm for Individual Haplotyping with Minimum Error Correction

Author
Publication venue: 'Hindawi Limited'
Publication date
Field of study

Crossref