184 research outputs found
A Robust Scheme for Multilevel Extendible Hashing
Dynamic hashing, while surpassing other access methods for uniformly distributed data, usually performs badly for non-uniformly distributed data. We propose a robust scheme for multi-level extendible hashing allowing efficient processing of skewed data as well as uniformly distributed data. In order to test our access method we implemented it and compared it to several existing hashing schemes. The results of the experimental evaluation demonstrate the superiority of our approach in both index size and performance
Taking the Shortcut: Actively Incorporating the Virtual Memory Index of the OS to Hardware-Accelerate Database Indexing
Index structures often materialize one or multiple levels of explicit
indirections (aka pointers) to allow for a quick traversal to the data of
interest. Unfortunately, dereferencing a pointer to go from one level to the
other is costly since additionally to following the address, it involves two
address translations from virtual memory to physical memory under the hood. In
the worst case, such an address translation is resolved by an index access
itself, namely by a lookup into the page table, a central hardware-accelerated
index structure of the OS. However, if the page table is anyways constantly
queried, it raises the question whether we can actively incorporate it into our
database indexes and make it work for us. Precisely, instead of materializing
indirections in form of pointers, we propose to express these indirections
directly in the page table wherever possible. By introducing such shortcuts, we
(a) effectively reduce the height of traversal during lookups and (b) exploit
the hardware-acceleration of lookups in the page table. In this work, we
analyze the strengths and considerations of this approach and showcase its
effectiveness at the case of the real-world indexing scheme extendible hashing
Node.js based Document Store for Web Crawling
WARC files are central to internet preservation projects. They contain the raw resources of web crawled data and can be used to create windows into the past of web pages at the time they were accessed. Yet there are few tools that manipulate WARC files outside of basic parsing. The creation of our tool WARC-KIT gives users in the Node.js JavaScript environment, a tool kit to interact with and manipulate WARC files.
Included with WARC-KIT is a WARC parsing tool known as WARCFilter that can be used standalone tool to parse, filter, and create new WARC files. WARCFilter can also, create CDX index files on the WARC files, parse existing CDX files, or even generate webgraph datasets for graph analysis algorithms. Aside from WARCFilter, WARC-KIT includes a custom on disk database system implemented with an underlying Linear Hash Table data structure. The database system is the first of its kind as a JavaScript only on disk document store. The overall main application of WARC-KIT is that it allows users to create custom indices upon collections of WARC files. After creating an index on a WARC collections, users are then query their collection using the GraphQL query language to retrieve desired WARC records.
Experiments with WARCFilter on a WARC dataset composed of 238,000 WARC records demonstrates that utilizing CDX index files speeds WARC record filtering around ten to twenty times faster than raw WARC parsing. Database timing tests with the JavaScript Linear Hash Table database system displayed twice as fast insertion and retrieval operations than a similar Rust implemented Linear Hash Table database. Experiments with the overall WARC-KIT application on the same 238,000 WARC record dataset exhibited consistent query times for different complex queries
-storage : a self organizing multi-attribute storage technique for very large main memories
Main memory is continuously improving both in price and capacity. With this comes new storage problems as well as new directions of usage. Just before the millennium, several main memory database systems are becoming commercially available. The hot areas include boosting the performance of web-enabled systems, such as search-engines, and auctioning systems. We present a novel data storage structure -- the {em -storage structure, a high performance data structure, allowing automatically indexed storage of {em very large amounts of multi-attribute data. The experiments show excellent performance for point retrieval, and highly efficient pruning for {em pattern searches. It provides the balanced storage previously achieved by random kd-trees, but avoids their increased pattern match search times, by an effective assignment bits of attributes. Moreover, it avoids the sensitivity of the kd-tree to insert orders
Advance of the Access Methods
The goal of this paper is to outline the advance of the access methods in the last ten years as well as
to make review of all available in the accessible bibliography methods
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
- …