27 research outputs found

    Optimal Hashing in External Memory

    Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Patrasu established an update/query tradeoff curve for external-hash tables: a hash table that performs insertions in O(lambda/B) amortized IOs requires Omega(log_lambda N) expected IOs for queries, where N is the number of items that can be stored in the data structure, B is the size of a memory transfer, M is the size of memory, and lambda is a tuning parameter. They provide a complicated hashing data structure, which we call the IP hash table, that meets this curve for lambda that is Omega(log log M + log_M N). In this paper, we present a simpler external-memory hash table, the Bundle of Arrays Hash Table (BOA), that is optimal for a narrower range of lambda. The simplicity of BOAs allows them to be readily modified to achieve the following results: - A new external-memory data structure, the Bundle of Trees Hash Table (BOT), that matches the performance of the IP hash table, while retaining some of the simplicity of the BOAs. - The Cache-Oblivious Bundle of Trees Hash Table (COBOT), the first cache-oblivious hash table. This data structure matches the optimality of BOTs and IP hash tables over the same range of lambda

    Assert(!Defined(Sequential I/O))

    The term sequential I/O is widely used in systems research with the intuitive understanding that it means consecutive access. From a survey of the literature, though, this intuitive understanding has translated into numerous, inconsistent definitions. Since sequential I/O is such a fundamental concept in systems research, we believe that a sequentiality metric should allow us to compare access patterns in a meaningful way. We explore access properties that could be incorporated into potential metrics for sequential I/O including: access size, gaps between accesses, multi-stream, and inter-arrival time. We then analyze hundreds of largescale storage traces and discuss how potential metrics compare. Interestingly, we find I/O traces considered highly sequential by one metric can be highly random to another metric. We further demonstrate that many plausible metrics are weakly correlated, though metrics weighted by size have more consistency. While there may not be a single metric for sequential I/O that is best in all cases, we believe systems researchers should more carefully consider, and state, which definition they use

    A survey and classification of storage deduplication systems

    The automatic elimination of duplicate data in a storage system commonly known as deduplication is increasingly accepted as an effective technique to reduce storage costs. Thus, it has been applied to different storage types, including archives and backups, primary storage, within solid state disks, and even to random access memory. Although the general approach to deduplication is shared by all storage types, each poses specific challenges and leads to different trade-offs and solutions. This diversity is often misunderstood, thus underestimating the relevance of new research and development. The first contribution of this paper is a classification of deduplication systems according to six criteria that correspond to key design decisions: granularity, locality, timing, indexing, technique, and scope. This classification identifies and describes the different approaches used for each of them. As a second contribution, we describe which combinations of these design decisions have been proposed and found more useful for challenges in each storage type. Finally, outstanding research challenges and unexplored design points are identified and discussed.This work is funded by the European Regional Development Fund (EDRF) through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the Fundacao para a Ciencia e a Tecnologia (FCT; Portuguese Foundation for Science and Technology) within project RED FCOMP-01-0124-FEDER-010156 and the FCT by PhD scholarship SFRH-BD-71372-2010

    Distinctive regions of 3d surfaces

    Selecting the most important regions of a surface is useful for shape matching and a variety of applications in computer graphics and geometric modeling. While previous research has analyzed geometric properties of meshes in isolation, we select regions that distinguish a shape from objects of a different type. Our approach to analyzing distinctive regions is based on performing a shape-based search using each region as a query into a database. Distinctive regions of a surface have shape consistent with objects of the same type and different from objects of other types. We demonstrate the utility of detecting distinctive surface regions for shape matching and other graphics applications including mesh visualization, icon generation, and mesh simplification