9 research outputs found

    Order Preserving Minimal Perfect Hash Functions and Information Retrieval

    Get PDF
    Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. The algorithms described give hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can be found using time and space that are linear in the number of keys involved and so are close to optimal

    Improving Cuckoo Hashing with Perfect Hashing

    Get PDF
    Title from PDF of title page viewed January 8, 2018Thesis advisor: Yijie HanVitaIncludes bibliographical references (pages 33-34)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2017In computer science, the data structure is a systematic way of organizing data such that it can be used efficiently. There are many hashing techniques that aim at storing keys in memory to increase key access efficiency and to make hashing efficient. One option to increase throughput is to use the algorithms based on hashing. Cuckoo Hashing is one among the techniques which provide high memory usage in constant access time. Cuckoo Hashing, in turn, uses many implementations among which Parallel-d-Pipeline is more efficient. Perfect Hashing maps distinct elements to set of integers without any collision. Perfect Hashing is fast and hit ratio is high. There are many other hashing techniques like Perfect Hashing but the reason we choose perfect hashing as it doesn’t require collision resolution mechanism. Cuckoo Hashing has high memory usage in allocating keys to its memory. So, we are combining Cuckoo Hashing and Perfect Hashing to increase the keys hit ratio and memory utilization.Introduction -- Cuckoo hashing and its implementation -- Perfect hashing -- Our approach -- Implementation -- Analysis -- Conclusio

    Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS

    Get PDF
    HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.The authors thank the anonymous reviewers for their insightful suggestions. This work is supported by the National Natural Science Foundation of China (Grant No. 61602037 )

    Enabling Internet-Scale Publish/Subscribe In Overlay Networks

    Get PDF
    As the amount of data in todays Internet is growing larger, users are exposed to too much information, which becomes increasingly more difficult to comprehend. Publish/subscribe systems leverage this problem by providing loosely-coupled communications between producers and consumers of data in a network. Data consumers, i.e., subscribers, are provided with a subscription mechanism, to express their interests in a subset of data, in order to be notified only when some data that matches their subscription is generated by the producers, i.e., publishers. Most publish/subscribe systems today, are based on the client/server architectural model. However, to provide the publish/subscribe service in large scale, companies either have to invest huge amount of money for over-provisioning the resources, or are prone to frequent service failures. Peer-to-peer overlay networks are attractive alternative solutions for building Internet-scale publish/subscribe systems. However, scalability comes with a cost: a published message often needs to traverse a large number of uninterested (unsubscribed) nodes before reaching all its subscribers. We refer to this undesirable traffic, as relay overhead. Without careful considerations, the relay overhead might sharply increase resource consumption for the relay nodes (in terms of bandwidth transmission cost, CPU, etc) and could ultimately lead to rapid deterioration of the system’s performance once the relay nodes start dropping the messages or choose to permanently abandon the system. To mitigate this problem, some solutions use unbounded number of connections per node, while some other limit the expressiveness of the subscription scheme. In this thesis work, we introduce two systems called Vitis and Vinifera, for topic-based and content-based publish/subscribe models, respectively. Both these systems are gossip-based and significantly decrease the relay overhead. We utilize novel techniques to cluster together nodes that exhibit similar subscriptions. In the topic-based model, distinct clusters for each topic are constructed, while clusters in the content-based model are fuzzy and do not have explicit boundaries. We augment these clustered overlays by links that facilitate routing in the network. We construct a hybrid system by injecting structure into an otherwise unstructured network. The resulting structures resemble navigable small-world networks, which spans along clusters of nodes that have similar subscriptions. The properties of such overlays make them an ideal platform for efficient data dissemination in large-scale systems. The systems requires only a bounded node degree and as we show, through simulations, they scale well with the number of nodes and subscriptions and remain efficient under highly complex subscription patterns, high publication rates, and even in the presence of failures in the network. We also compare both systems against some state-of-the-art publish/subscribe systems. Our measurements show that both Vitis and Vinifera significantly outperform their counterparts on various subscription and churn scenarios, under both synthetic workloads and real-world traces

    Order preserving minimal perfect hash functions and information retrieval

    No full text
    Rapid access to information is essential for a wide variety of retrieval systems and applications. Hashing has long been used when the fastest possible direct search is desired, but is generally not appropriate when sequential or range searches are also required. This paper describes a hashing method, developed for collections that are relatively static, that supports both direct and sequential access. Indeed, the algorithm described gives hash functions that are optimal in terms of time and hash table space utilization, and that preserve any a priori ordering desired. Furthermore, the resulting order preserving minimal perfect hash functions (OPMPHFs) can b

    A Simplified Faceted Approach To Information Retrieval for Reusable Software Classification

    Get PDF
    Software Reuse is widely recognized as the most promising technique presently available in reducing the cost of software production. It is the adaptation or incorporation of previously developed software components, designs or other software-related artifacts (i.e. test plans) into new software or software development regimes. Researchers and vendors are doubling their efforts and devoting their time primarily to the topic of software reuse. Most have focused on mechanisms to construct reusable software but few have focused on the problem of discovering components or designs to meet specific needs. In order for software reuse to be successful, it must be perceived to be less costly to discover a software component or related artifact to satisfy a given need than to discover one anew. As results, this study will describe a method to classify software components that meet a specified need. Specifically, the purpose of the present research study is to provide a flexible system, comprised of a classification scheme and searcher system, entitled Guides-Search, in which processes can be retrieved by carrying out a structured dialogue with the user. The classification scheme provides both the structure of questions to be posed to the user, and the set of possible answers to each question. The model is not an attempt to replace current structures; but rather, seeks to provide a conceptual and structural method to support the improvement of software reuse methodology. The investigation focuses on the following goals and objectives for the classification scheme and searcher system: the classification will be flexible and extensible, but usable by the Searcher; the user will not be presented with a large number of questions; the user will never be required to answer a question not known to be germane to the query
    corecore