Search CORE

1,163 research outputs found

Practical Target-Based Synchronization Strategies for Immutable Time-Series Data Tables

Author: Meares Bennett Andrew
Publication venue: Clemson University Libraries
Publication date: 01/12/2021
Field of study

As the Internet of Things and industrial monitoring of utilities grow, efficiently synchronizing immutable time-series data streams between databases becomes a pressing issue. Extracting data from critical production databases demands careful consideration of the stress imposed on the machines, so synchronization strategies are required to minimize the transfer of duplicate data and the load imposed on remote sources. Literature on the synchronization problem is generalized to arbitrary tables and does not consider the characteristics of time-series data streams, so research was required to investigate methods to quickly synchronize source and target time-series data tables. This thesis examines immutable time-series scenarios and synchronization strategies to answer the following question: given several scenarios, which target-based immutable time-series synchronization strategies best optimize run-time, bandwidth, and accuracy? The strategies explored in this research are implemented into the Meerschaum system, a project intended to leverage these time-series concepts for production deployments. As a practical demonstration, these strategies are used to continuously cache Clemson University’s utilities data

Clemson University: TigerPrints

Efficient Reconciliation of Genomic Datasets of High Similarity

Author: Belazzougui Djamal
Kucherov Gregory
Shibuya Yoshihiro
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes)

Dagstuhl Research Online Publication Server

HAL-Ecole des Ponts ParisTech