257 research outputs found

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    A Survey of Hashing Techniques for High Performance Computing

    Get PDF
    Hashing is a well-known and widely used technique for providing O(1) access to large files on secondary storage and tables in memory. Hashing techniques were introduced in the early 60s. The term hash function historically is used to denote a function that compresses a string of arbitrary input to a string of fixed length. Hashing finds applications in other fields such as fuzzy matching, error checking, authentication, cryptography, and networking. Hashing techniques have found application to provide faster access in routing tables, with the increase in the size of the routing tables. More recently, hashing has found applications in transactional memory in hardware. Motivated by these newly emerged applications of hashing, in this paper we present a survey of hashing techniques starting from traditional hashing methods with greater emphasis on the recent developments. We provide a brief explanation on hardware hashing and a brief introduction to transactional memory

    BlueDBM: An Appliance for Big Data Analytics

    Get PDF
    Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.Quanta Computer (Firm)Samsung (Firm)Lincoln Laboratory (PO7000261350)Intel Corporatio

    Advanced Techniques for Improving the Efficacy of Digital Forensics Investigations

    Get PDF
    Digital forensics is the science concerned with discovering, preserving, and analyzing evidence on digital devices. The intent is to be able to determine what events have taken place, when they occurred, who performed them, and how they were performed. In order for an investigation to be effective, it must exhibit several characteristics. The results produced must be reliable, or else the theory of events based on the results will be flawed. The investigation must be comprehensive, meaning that it must analyze all targets which may contain evidence of forensic interest. Since any investigation must be performed within the constraints of available time, storage, manpower, and computation, investigative techniques must be efficient. Finally, an investigation must provide a coherent view of the events under question using the evidence gathered. Unfortunately the set of currently available tools and techniques used in digital forensic investigations does a poor job of supporting these characteristics. Many tools used contain bugs which generate inaccurate results; there are many types of devices and data for which no analysis techniques exist; most existing tools are woefully inefficient, failing to take advantage of modern hardware; and the task of aggregating data into a coherent picture of events is largely left to the investigator to perform manually. To remedy this situation, we developed a set of techniques to facilitate more effective investigations. To improve reliability, we developed the Forensic Discovery Auditing Module, a mechanism for auditing and enforcing controls on accesses to evidence. To improve comprehensiveness, we developed ramparser, a tool for deep parsing of Linux RAM images, which provides previously inaccessible data on the live state of a machine. To improve efficiency, we developed a set of performance optimizations, and applied them to the Scalpel file carver, creating order of magnitude improvements to processing speed and storage requirements. Last, to facilitate more coherent investigations, we developed the Forensic Automated Coherence Engine, which generates a high-level view of a system from the data generated by low-level forensics tools. Together, these techniques significantly improve the effectiveness of digital forensic investigations conducted using them

    High-Performance Computing Algorithms for Constructing Inverted Files on Emerging Multicore Processors

    Get PDF
    Current trends in processor architectures increasingly include more cores on a single chip and more complex memory hierarchies, and such a trend is likely to continue in the foreseeable future. These processors offer unprecedented opportunities for speeding up demanding computations if the available resources can be effectively utilized. Simultaneously, parallel programming languages such as OpenMP and MPI have been commonly used on clusters of multicore CPUs while newer programming languages such as OpenCL and CUDA have been widely adopted on recent heterogeneous systems and GPUs respectively. The main goal of this dissertation is to develop techniques and methodologies for exploiting these emerging parallel architectures and parallel programming languages to solve large scale irregular applications such as the construction of inverted files. The extraction of inverted files from large collections of documents forms a critical component of all information retrieval systems including web search engines. In this problem, the disk I/O throughput is the major performance bottleneck especially when intermediate results are written onto disks. In addition to the I/O bottleneck, a number of synchronization and consistency issues must be resolved in order to build the dictionary and postings lists efficiently. To address these issues, we introduce a dictionary data structure using a hybrid of trie and B-trees and a high-throughput pipeline strategy that completely avoids the use of disks as temporary storage for intermediate results, while ensuring the consumption of the input data at a high rate. The high-throughput pipelined strategy produces parallel parsed streams that are consumed at the same rate by parallel indexers. The pipelined strategy is implemented on a single multicore CPU as well as on a cluster of such nodes. We were able to achieve a throughput of more than 262MB/s on the ClueWeb09 dataset on a single node. On a cluster of 32 nodes, our experimental results show scalable performance using different metrics, significantly improving on prior published results. On the other hand, we develop a new approach for handling time-evolving documents using additional small temporal indexing structures. The lifetime of the collection is partitioned into multiple time windows, which guarantees a very fast temporal query response time at a small space overhead relative to the non-temporal case. Extensive experimental results indicate that the overhead in both indexing and querying is small in this more complicated case, and the query performance can indeed be improved using finer temporal partitioning of the collection. Finally, we employ GPUs to accelerate the indexing process for building inverted files and to develop a very fast algorithm for the highly irregular list ranking problem. For the indexing problem, the workload is split between CPUs and GPUs in such a way that the strengths of both architectures are exploited. For the list ranking problem involved in the decompression of inverted files, an optimized GPU algorithm is introduced by reducing the problem to a large number of fine grain computations in such a way that the processing cost per element is shown to be close to the best possible

    Efficient Computation of K-Nearest Neighbor Graphs for Large High-Dimensional Data Sets on GPU Clusters

    Get PDF
    The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data (\u3c15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set. This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules
    • …
    corecore