737 research outputs found
NGS Based Haplotype Assembly Using Matrix Completion
We apply matrix completion methods for haplotype assembly from NGS reads to
develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by
applying a mathematical model to convert the reads to an incomplete matrix and
estimating unknown components. This process is followed by quantizing and
decoding the completed matrix in order to estimate haplotypes. These algorithms
are compared to the state-of-the-art algorithms using simulated data as well as
the real fosmid data. It is shown that the SNP missing rate and the haplotype
block length of the proposed HapOPT are better than those of HapCUT2 with
comparable accuracy in terms of reconstruction rate and switch error rate. A
program implementing the proposed algorithms in MATLAB is freely available at
https://github.com/smajidian/HapMC
Haplotype Assembly: An Information Theoretic View
This paper studies the haplotype assembly problem from an information
theoretic perspective. A haplotype is a sequence of nucleotide bases on a
chromosome, often conveniently represented by a binary string, that differ from
the bases in the corresponding positions on the other chromosome in a
homologous pair. Information about the order of bases in a genome is readily
inferred using short reads provided by high-throughput DNA sequencing
technologies. In this paper, the recovery of the target pair of haplotype
sequences using short reads is rephrased as a joint source-channel coding
problem. Two messages, representing haplotypes and chromosome memberships of
reads, are encoded and transmitted over a channel with erasures and errors,
where the channel model reflects salient features of high-throughput
sequencing. The focus of this paper is on the required number of reads for
reliable haplotype reconstruction, and both the necessary and sufficient
conditions are presented with order-wise optimal bounds.Comment: 30 pages, 5 figures, 1 tabel, journa
ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing
<p>Abstract</p> <p>Background</p> <p>Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs). By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required.</p> <p>Results</p> <p>We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo.</p> <p>Conclusions</p> <p>ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at <url>http://en.sourceforge.jp/projects/parallelgwas/releases/</url>.</p
Recommended from our members
PRIMAL: Fast and Accurate Pedigree-based Imputation from Sequence Data in a Founder Population
Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.</p
Complete haplotype phasing of the MHC and KIR loci with targeted HaploSeq
BackgroundThe MHC and KIR loci are clinically relevant regions of the genome. Typing the sequence of these loci has a wide range of applications including organ transplantation, drug discovery, pharmacogenomics and furthering fundamental research in immune genetics. Rapid advances in biochemical and next-generation sequencing (NGS) technologies have enabled several strategies for precise genotyping and phasing of candidate HLA alleles. Nonetheless, as typing of candidate HLA alleles alone reveals limited aspects of the genetics of MHC region, it is insufficient for the comprehensive utility of the aforementioned applications. For this reason, we believe phasing the entire MHC and KIR locus onto a single locus-spanning haplotype can be a critical improvement for better understanding transplantation biology.ResultsGenerating long-range (>1 Mb) phase information is traditionally very challenging. As proximity-ligation based methods of DNA sequencing preserves chromosome-span phase information, we have utilized this principle to demonstrate its utility towards generating full-length phasing of MHC and KIR loci in human samples. We accurately (~99%) reconstruct the complete haplotypes for over 90% of sequence variants (coding and non-coding) within these two loci that collectively span 4-megabases.ConclusionsBy haplotyping a majority of coding and non-coding alleles at the MHC and KIR loci in a single assay, this method has the potential to assist transplantation matching and facilitate investigation of the genetic basis of human immunity and disease
A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
Reconstructing components of a genomic mixture from data obtained by means of
DNA sequencing is a challenging problem encountered in a variety of
applications including single individual haplotyping and studies of viral
communities. High-throughput DNA sequencing platforms oversample mixture
components to provide massive amounts of reads whose relative positions can be
determined by mapping the reads to a known reference genome; assembly of the
components, however, requires discovery of the reads' origin -- an NP-hard
problem that the existing methods struggle to solve with the required level of
accuracy. In this paper, we present a learning framework based on a graph
auto-encoder designed to exploit structural properties of sequencing data. The
algorithm is a neural network which essentially trains to ignore sequencing
errors and infers the posteriori probabilities of the origin of sequencing
reads. Mixture components are then reconstructed by finding consensus of the
reads determined to originate from the same genomic component. Results on
realistic synthetic as well as experimental data demonstrate that the proposed
framework reliably assembles haplotypes and reconstructs viral communities,
often significantly outperforming state-of-the-art techniques
Characterizing the admixed African ancestry of African Americans
Genome-wide SNP analyses reveal the admixed African genetic ancestry of African Americans
Recommended from our members
Variation rs2235503 C > A Within the Promoter of MSLN Affects Transcriptional Rate of Mesothelin and Plasmatic Levels of the Soluble Mesothelin-Related Peptide.
Soluble mesothelin-related peptide (SMRP) is a promising biomarker for malignant pleural mesothelioma (MPM), but several confounding factors can reduce SMRP-based test's accuracy. The identification of these confounders could improve the diagnostic performance of SMRP. In this study, we evaluated the sequence of 1,000 base pairs encompassing the minimal promoter region of the MSLN gene to identify expression quantitative trait loci (eQTL) that can affect SMRP. We assessed the association between four MSLN promoter variants and SMRP levels in a cohort of 72 MPM and 677 non-MPM subjects, and we carried out in vitro assays to investigate their functional role. Our results show that rs2235503 is an eQTL for MSLN associated with increased levels of SMRP in non-MPM subjects. Furthermore, we show that this polymorphic site affects the accuracy of SMRP, highlighting the importance of evaluating the individual's genetic background and giving novel insights to refine SMRP specificity as a diagnostic biomarker
Variation rs2235503 C > A Within the Promoter of MSLN Affects Transcriptional Rate of Mesothelin and Plasmatic Levels of the Soluble Mesothelin-Related Peptide
Soluble mesothelin-related peptide (SMRP) is a promising biomarker for malignant pleural mesothelioma (MPM), but several confounding factors can reduce SMRP-based test’s accuracy. The identification of these confounders could improve the diagnostic performance of SMRP. In this study, we evaluated the sequence of 1,000 base pairs encompassing the minimal promoter region of the MSLN gene to identify expression quantitative trait loci (eQTL) that can affect SMRP. We assessed the association between four MSLN promoter variants and SMRP levels in a cohort of 72 MPM and 677 non-MPM subjects, and we carried out in vitro assays to investigate their functional role. Our results show that rs2235503 is an eQTL for MSLN associated with increased levels of SMRP in non-MPM subjects. Furthermore, we show that this polymorphic site affects the accuracy of SMRP, highlighting the importance of evaluating the individual’s genetic background and giving novel insights to refine SMRP specificity as a diagnostic biomarker
- …