Search CORE

1,898 research outputs found

Hardness of Covering Alignment : Phase Transition in Post-Sequence Genomics

Author: Cairo Massimo
Mäkinen Veli
Rizzi Romeo
Tomescu Alexandru I.
Valenzuela Daniel
Publication venue
Publication date: 01/01/2018
Field of study

Covering alignment problems arise from recent developments in genomics; so called pan-genome graphs are replacing reference genomes, and advances in haplotyping enable full content of diploid genomes to be used as basis of sequence analysis. In this paper, we show that the computational complexity will change for natural extensions of alignments to pan-genome representations and to diploid genomes. More broadly, our approach can also be seen as a minimal extension of sequence alignment to labelled directed acyclic graphs (labeled DAGs). Namely, we show that finding a covering alignment of two labeled DAGs is NP-hard even on binary alphabets. A covering alignment asks for two paths R-1 (red) and G(1) (green) in DAG D-1 and two paths R-2 (red) and G(2) (green) in DAG D-2 that cover the nodes of the graphs and maximize the sum of the global alignment scores: asosp(R-1), sp(R-2)) + asosp(G(1)), sp(G(2))), where sp(P) is the concatenation of labels on the path P. Pair-wise alignment of haplotype sequences forming a diploid chromosome can be converted to a two-path coverable labelled DAG, and then the covering alignment models the similarity of two diploids over arbitrary recombinations. We also give a reduction to the other direction, to show that such a recombination-oblivious diploid alignment is NP-hard on alphabets of size 3.Peer reviewe

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca

Helsingin yliopiston digitaalinen arkisto

Genomic resource development for a diploid mint: Mentha longifolia

Author: Hadadian Zahra
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/01/2010
Field of study

This research project aimed to develop genomic resources needed to enable construction of a genetic linkage map of the diploid mint species Mentha longifolia. Such a map would facilitate identification of plant genes involved in resistance to Verticillium fungal infection. For this purpose, a small genomic library was constructed from germplasm accession CMEN 585, 279 genomic inserts were sequenced and annotated and 19 PCR primer pairs were designed and tested on two resistant and two susceptible accessions. The Cleaved Modified Polymorphic Sequence (CAPS) method of molecular marker genotyping was found to detect little variation between crossing parents CMEN 585 (resistant) and CMEN 584 (susceptible). Comparative sequencing of PCR products from two European and two South African accessions revealed greater diversity between than within geographic locations. Future efforts should focus on assessing more sensitive genotyping methods, and developing a mapping population from a cross between European and South African accessions

UNH Scholars' Repository

Learning Character Strings via Mastermind Queries, with a Case Study Involving mtDNA

Author: Goodrich Michael T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/04/2010
Field of study

We study the degree to which a character string,

Q

, leaks details about itself any time it engages in comparison protocols with a strings provided by a querier, Bob, even if those protocols are cryptographically guaranteed to produce no additional information other than the scores that assess the degree to which

Q

matches strings offered by Bob. We show that such scenarios allow Bob to play variants of the game of Mastermind with

Q

so as to learn the complete identity of

Q

. We show that there are a number of efficient implementations for Bob to employ in these Mastermind attacks, depending on knowledge he has about the structure of

Q

, which show how quickly he can determine

Q

. Indeed, we show that Bob can discover

Q

using a number of rounds of test comparisons that is much smaller than the length of

Q

, under reasonable assumptions regarding the types of scores that are returned by the cryptographic protocols and whether he can use knowledge about the distribution that

Q

comes from. We also provide the results of a case study we performed on a database of mitochondrial DNA, showing the vulnerability of existing real-world DNA data to the Mastermind attack.Comment: Full version of related paper appearing in IEEE Symposium on Security and Privacy 2009, "The Mastermind Attack on Genomic Data." This version corrects the proofs of what are now Theorems 2 and 4

arXiv.org e-Print Archive

Crossref

Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

Author: Li Heng
Publication venue: 'Oxford University Press (OUP)'
Publication date: 22/07/2015
Field of study

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods. Results: We made ten SNP and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb without significant compromise on the sensitivity. Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts: https://github.com/lh3/varcmp; Additional data: https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio

arXiv.org e-Print Archive

CiteSeerX

Next Generation Cluster Editing

Author: Bellitto Thomas
Klau Gunnar W.
Marschall Tobias
Schönhuth Alexander
Publication venue
Publication date: 01/01/2013
Field of study

This work aims at improving the quality of structural variant prediction from the mapped reads of a sequenced genome. We suggest a new model based on cluster editing in weighted graphs and introduce a new heuristic algorithm that allows to solve this problem quickly and with a good approximation on the huge graphs that arise from biological datasets

arXiv.org e-Print Archive

CWI's Institutional Repository

Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery

Author: Alkan Can
Dao Phuong
Eichler Evan E.
Hach Faraz
Hajirasouliha Iman
Hormozdiari Fereydoun
Sahinalp S. Cenk
Yorukoglu Deniz
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507)

CiteSeerX

PubMed Central

Minimum error correction-based haplotype assembly: considerations for long read data

Author: de Ridder Dick
Kahaei Mohammad Hossein
Majidian Sina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

Author: Kingsford Carl
Ma Cong
Qiu Yutong
Xie Han
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)
Publication date: 01/01/2019
Field of study

Transcriptomic structural variants (TSVs) - large-scale transcriptome sequence change due to structural variation - are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangement Problem (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1. Combining these, we obtain a 3/16-approximation algorithm for MCAP when k=2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the graph structures that require k>1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID

Dagstuhl Research Online Publication Server