8 research outputs found
Subset Wavelet Trees
Given an alphabet ÎŁ of Ï = |ÎŁ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]âŠ, X[n-1] of n subsets of ÎŁ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.Peer reviewe
Syotti : scalable bait design for DNA enrichment
Motivation: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. Results: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h.Peer reviewe
HaploBlocks : Efficient Detection of Positive Selection in Large Population Genomic Datasets
Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.Peer reviewe
Themisto : a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
Publisher Copyright: © 2023 The Author(s). Published by Oxford University Press.Motivation: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount.Results: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.Availability and implementation: Themisto is available and documented as a C++ package at available under the GPLv2 license.Peer reviewe
Subset Wavelet Trees
Given an alphabet ÎŁ of Ï = |ÎŁ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]âŠ, X[n-1] of n subsets of ÎŁ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.Peer reviewe
FGI-OSNMA: An Open Source Implementation of Galileoâs Open Service Navigation Message Authentication
The European Global Navigation Satellite System (GNSS) Galileo is launching the Open Service Navigation Message Authentication (OSNMA) to enable navigation message authentication for all users, and therefore increasing the resiliency against spoofing. The Finnish Geospatial Research Institute (FGI) has developed an open source implementation of Galileoâs OSNMA, henceforth known as FGI-OSNMA. FGI-OSNMA is a Python library functioning as a OSNMA computation engine with special emphasis put into its modularity, usability in real time, and integrability as a library in third party applications. In particular, the library is being integrated to the software receiver FGI-GSRx and the GNSS situational awareness service GNSS-Finland. In addition to this, our software package contains useful tools, such as scripts to compute and visualize key performance indicators (KPIs) related to authentication, and a filter to remove unauthenticated messages from RINEX navigation and observables files. This paper presents an overview of the features of FGI-OSNMA, followed by description of the architecture and the rationale behind the design. Finally, the paper concludes by demonstrating practical examples and real-world applications of the library
An Experimental Performance Assessment of Galileo OSNMA
We present Galileo Open Service Navigation Message Authentication (OSNMA) observed operational information and key performance indicators (KPIs) from the analysis of a ten-day-long dataset collected in static open-sky conditions in southern Finland and using our in-house-developed OSNMA implementation. In particular, we present a timeline with authentication-related events, such as authentication status and type, dropped navigation pages, and failed cyclic redundancy checks. We also report other KPIs, such as the number of simultaneously authenticated satellites over time, time to first authenticated fix, and percentage of authenticated fixes, and we evaluate the accuracy of the authenticated position solution. We also study how satellite visibility affects these figures. Finally, we analyze situations where it was not possible to reach an authenticated fix, and offer our findings on the observed patterns