414 research outputs found
NGS Based Haplotype Assembly Using Matrix Completion
We apply matrix completion methods for haplotype assembly from NGS reads to
develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by
applying a mathematical model to convert the reads to an incomplete matrix and
estimating unknown components. This process is followed by quantizing and
decoding the completed matrix in order to estimate haplotypes. These algorithms
are compared to the state-of-the-art algorithms using simulated data as well as
the real fosmid data. It is shown that the SNP missing rate and the haplotype
block length of the proposed HapOPT are better than those of HapCUT2 with
comparable accuracy in terms of reconstruction rate and switch error rate. A
program implementing the proposed algorithms in MATLAB is freely available at
https://github.com/smajidian/HapMC
Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping
Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples
Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm
Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.National Science Foundation CCF-1320273Electrical and Computer Engineerin
Algorithmic approaches for the single individual haplotyping problem
Since its introduction in 2001, the Single Individual Haplotyping problem has received an ever-increasing attention from the scientific community. In this paper we survey, in the form of an annotated bibliography, the developments in the study of the problem from its origin until our days
Haplotype estimation in polyploids using DNA sequence data
Polyploid organisms possess more than two copies of their core genome and therefore contain k>2 haplotypes for each set of ordered genomic variants. Polyploidy occurs often within the plant kingdom, among others in important corps such as potato (k=4) and wheat (k=6). Current sequencing technologies enable us to read the DNA and detect genomic variants, but cannot distinguish between the copies of the genome, each inherited from one of the parents. To detect inheritance patterns in populations, it is necessary to know the haplotypes, as alleles that are in linkage over the same chromosome tend to be inherited together. In this work, we develop mathematical optimisation algorithms to indirectly estimate haplotypes by looking into overlaps between the sequence reads of an individual, as well as into the expected inheritance of the alleles in a population. These algorithm deal with sequencing errors and random variations in the counts of reads observed from each haplotype. These methods are therefore of high importance for studying the genetics of polyploid crops. </p
Minimum error correction-based haplotype assembly: considerations for long read data
The single nucleotide polymorphism (SNP) is the most widely studied type of
genetic variation. A haplotype is defined as the sequence of alleles at SNP
sites on each haploid chromosome. Haplotype information is essential in
unravelling the genome-phenotype association. Haplotype assembly is a
well-known approach for reconstructing haplotypes, exploiting reads generated
by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often
used for reconstruction of haplotypes from reads. However, problems with the
MEC metric have been reported. Here, we investigate the MEC approach to
demonstrate that it may result in incorrectly reconstructed haplotypes for
devices that produce error-prone long reads. Specifically, we evaluate this
approach for devices developed by Illumina, Pacific BioSciences and Oxford
Nanopore Technologies. We show that imprecise haplotypes may be reconstructed
with a lower MEC than that of the exact haplotype. The performance of MEC is
explored for different coverage levels and error rates of data. Our simulation
results reveal that in order to avoid incorrect MEC-based haplotypes, a
coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure
High performance computing for haplotyping: Models and platforms
\u3cp\u3eThe reconstruction of the haplotype pair for each chromosome is a hot topic in Bioinformatics and Genome Analysis. In Haplotype Assembly (HA), all heterozygous Single Nucleotide Polymorphisms (SNPs) have to be assigned to exactly one of the two chromosomes. In this work, we outline the state-of-the-art on HA approaches and present an in-depth analysis of the computational performance of GenHap, a recent method based on Genetic Algorithms. GenHap was designed to tackle the computational complexity of the HA problem by means of a divide-et-impera strategy that effectively leverages multi-core architectures. In order to evaluate GenHap’s performance, we generated different instances of synthetic (yet realistic) data exploiting empirical error models of four different sequencing platforms (namely, Illumina NovaSeq, Roche/454, PacBio RS II and Oxford Nanopore Technologies MinION). Our results show that the processing time generally decreases along with the read length, involving a lower number of sub-problems to be distributed on multiple cores.\u3c/p\u3
- …