Search CORE

296 research outputs found

Cache-oblivious index for approximate string matching

Author: Hon WK
Lam TW
Shah R
Tam SL
Vitter JS
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin

Elsevier - Publisher Connector

HKU Scholars Hub

04091 Abstracts Collection -- Data Structures

Author: Albers Susanne
Sedgewick Robert
Wagner Dorothea
Publication venue: Dagstuhl Seminar Proceedings. 04091 - Data Structures
Publication date: 01/01/2005
Field of study

From 22.02. to 27.02.2004, Dagstuhl Seminar "Data Structures" was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar are put together in this paper. The first section describes the seminar topics and goals in general

Dagstuhl Research Online Publication Server

Anti-Persistence on Persistent Storage: History-Independent Sparse Tables and Dictionaries

Author: Bender Michael
Berry Jonathan
Johnson Rob
Kroeger Thomas
Mccauley Samuel
Phillips Cynthia
Simon Bertrand
Singh Shikha
Zage David
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

International audienceWe present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent cache-oblivious B-tree and a history-independent external-memory skip list. One of the main contributions is a data structure we build on the way—a history-independent packed-memory array (PMA). The PMA supports efficient range queries, one of the most important operations for answering database queries. Our HI PMA matches the asymptotic bounds of prior non-HI packed-memory arrays and sparse tables. Specifically, a PMA maintains a dynamic set of elements in sorted order in a linear-sized array. Inserts and deletes take an amortized O(log^2 N) element moves with high probability. Simple experiments with our implementation of HI PMAs corroborate our theoretical analysis. Comparisons to regular PMAs give preliminary indications that the practical cost of adding history-independence is not too large. Our HI cache-oblivious B-tree bounds match those of prior non-* HI cache-oblivious B-trees. Searches take O(log_B N) I/Os; inserts and deletes take O((log^2 N)/B + log_B N) amortized I/Os with high probability; and range queries returning k elements take O(log_B N + k/B) I/Os. Our HI external-memory skip list achieves optimal bounds with high probability, analogous to in-memory skip lists: O(log_B N) I/Os for point queries and amortized O(log_B N) I/Os for in-serts/deletes. Range queries returning k elements run in O(log_B N + k/B) I/Os. In contrast, the best possible high-probability bounds for inserting into the folklore B-skip list, which promotes elements with probability 1/B, is just Θ(log N) I/Os. This is no better than the bounds one gets from running an in-memory skip list in external memory

HAL-ENS-LYON

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Optimal Hashing in External Memory

Author: Conway Alex
Shilane Philip
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 01/01/2018
Field of study

Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Patrasu established an update/query tradeoff curve for external-hash tables: a hash table that performs insertions in O(lambda/B) amortized IOs requires Omega(log_lambda N) expected IOs for queries, where N is the number of items that can be stored in the data structure, B is the size of a memory transfer, M is the size of memory, and lambda is a tuning parameter. They provide a complicated hashing data structure, which we call the IP hash table, that meets this curve for lambda that is Omega(log log M + log_M N). In this paper, we present a simpler external-memory hash table, the Bundle of Arrays Hash Table (BOA), that is optimal for a narrower range of lambda. The simplicity of BOAs allows them to be readily modified to achieve the following results: - A new external-memory data structure, the Bundle of Trees Hash Table (BOT), that matches the performance of the IP hash table, while retaining some of the simplicity of the BOAs. - The Cache-Oblivious Bundle of Trees Hash Table (COBOT), the first cache-oblivious hash table. This data structure matches the optimality of BOTs and IP hash tables over the same range of lambda

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Non-Overlapping Indexing - Cache Obliviously

Author: Abedin Paniz
Hooshmand Sahar
Thankachan Sharma V.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

The non-overlapping indexing problem is defined as follows: pre-process a given text T[1,n] of length n into a data structure such that whenever a pattern P[1,p] comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in T. The best known solution is by Cohen and Porat [ISAAC, 2009]. Their index size is O(n) words and query time is optimal O(p+nocc), where nocc is the output size. We study this problem in the cache-oblivious model and present a new data structure of size O(n log n) words. It can answer queries in optimal O(p/(B)+log_B n+nocc/B) I/Os, where B is the block size

Dagstuhl Research Online Publication Server

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Recommended from our members

Cache-Oblivious Streaming B-Trees

Author: Bender Michael A.
Farach-Colton Martin
Fineman Jeremy T.
Fogel Yonatan R.
Kuszmaul Bradley C.
Nelson Jelani
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/01/2015
Field of study

A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA). For block-transfer size B and on N elements, the shuttle tree implements searches in optimal

O(log_{B+1}N)

transfers, range queries of L successive elements in optimal

O(log_{B+1}N +L/B)

transfers, and insertions in

O((log_{B+1}N)/B^{\Theta(1/(log log B)^2)}+(log^2N)/B)

transfers, which is an asymptotic speedup over traditional B-trees if

B ≥ (log N)^{1+c log log log^2 N}

for any constant c >1. A COLA implements searches in O(log N) transfers, range queries in O(log N + L/B) transfers, and insertions in amortized O((log N)/B) transfers, matching the bounds for a (cache-aware) buffered repository tree. A partially deamortized COLA matches these bounds but reduces the worst-case insertion cost to O(log N) if memory size

M = \Omega(log N)

. We also present a cache-aware version of the COLA, the lookahead array, which achieves the same bounds as Brodal and Fagerberg's (cache-aware)

B^{\epsilon}

-tree. We compare our COLA implementation to a traditional B-tree. Our COLA implementation runs 790 times faster for random insertions, 3.1 times slower for insertions of sorted data, and 3.5 times slower for searches.Engineering and Applied Science

Harvard University - DASH

Compressing dictionaries of strings

Author: LANDOLFI LORENZO
Publication venue: 'Pisa University Press'
Publication date: 27/04/2015
Field of study

The aim of this work is to develop a data structure capable of storing a set of strings in a compressed way providing the facility to access and search by prefix any string in the set. The notion of string will be formally exposed in this work, but it is enough to think a string as a stream of characters or a variable length dat}. We will prove that the data structure devised in our work will be able to search prefixes of the stored strings in a very efficient way, hence giving a performant solution to one of the most discussed problem of our age. In the discussion of our data structure, particular emphasis will be given to both space and time efficiency and a tradeoff between these two will be constantly searched. To understand how much string based data structures are important, think about modern search engines and social networks; they must store and process continuously immense streams of data which are mainly strings, while the output of such processed data must be available in few milliseconds not to try the patience of the user. Space efficiency is one of the main concern in this kind of problem. In order to satisfy real-time latency bounds, the largest possible amount of data must be stored in the highest levels of the memory hierarchy. Moreover, data compression allows to save money because it reduces the amount of physical memory needed to store abstract data and this particularly important since storage is the main source of expenditure in modern systems

Electronic Thesis and Dissertation Archive - Università di Pisa