Search CORE

5 research outputs found

Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables

Author: Grossi Roberto
Versari Luca
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th Annual European Symposium on Algorithms (ESA 2018)
Publication date: 01/01/2018
Field of study

This paper proposes round-hashing, which is suitable for data storage on distributed servers and for implementing external-memory tables in which each lookup retrieves at most one single block of external memory, using a stash. For data storage, round-hashing is like consistent hashing as it avoids a full rehashing of the keys when new servers are added. Experiments show that the speed to serve requests is tenfold or more than the state of the art. In distributed data storage, this guarantees better throughput for serving requests and, moreover, greatly reduces decision times for which data should move to new servers as rescanning data is much faster

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

Cache-Oblivious Hashing

Author: A. Aggarwal
A. Pagh
B. He
D.E. Knuth
E. Demaine
E. Verbin
G. Tenenbaum
G.H. Gonnet
G.S. Brodal
H.G. Mairson
J. Carter
J. Schmidt
Ke Yi
M. Frigo
M. Wegman
M.A. Bender
M.A. Bender
M.L. Fredman
M.S. Jensen
P. Afshani
P.-A. Larson
P.-A. Larson
Qin Zhang
R. Fagin
R. Motwani
R. Pagh
Rasmus Pagh
W. Litwin
Z. Wei
Zhewei Wei
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once

Author: Bender Michael A.
Conway Alex
Farach-Colton Martín
Kuszmaul William
Tagliavini Guido
Publication venue
Publication date: 22/10/2023
Field of study

Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed. This paper introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constant-time operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to

1 - o(1)

while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are

\Theta(\log n)

bits, the space guarantees that Iceberg hashing offers, namely that it uses at most

\log \binom{|U|}{n} + O(n \log \log n)

bits to store

n

items from a universe

U

, matches a lower bound by Demaine et al. that applies to any stable hash table. Iceberg hashing introduces new general-purpose techniques for some of the most basic aspects of hash-table design. Notably, our indirection-free technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and very-high probability guarantees, can be applied to any hash table that makes use of the front-yard/backyard paradigm for hash table design

arXiv.org e-Print Archive

Indices and Applications in High-Throughput Sequencing

Author: Weese D.
Publication venue
Publication date: 05/06/2013
Field of study

Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

Optimality in External Memory Hashing

Author: A. Aggarwal
A.Z. Broder
D.E. Knuth
F. Cesarini
G.H. Gonnet
J.K. Mullin
K. Ramamohanarao
M. Koushik
Morten Skaarup Jensen
P. Kjellberg
P.-Å. Larson
P.-Å. Larson
P.-Å. Larson
P.-Å. Larson
Rasmus Pagh
W. Litwin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Hash tables on external memory are commonly used for indexing in database management systems. In this paper we present an algorithm that, in an asymptotic sense, achieves the best possible I/O and space complexities. Let B denote the number of records that fit in a block, and let N denote the total number of records. Our hash table uses 1 + O(1 / √ B) I/Os, expected, for looking up a record (no matter if it is present or not). To insert, delete or change a record that has just been looked up requires 1 + O(1 / √ B) I/Os, amortized expected, including I/Os for reorganizing the hash table when the size of the database changes. The expected external space usage is 1 + O(1 / √ B) times the optimum of N/B blocks, and just O(1) blocks of internal memory are needed

CiteSeerX

Crossref

The IT University of Copenhagen's Repository