13 research outputs found
BlueDBM: An Appliance for Big Data Analytics
Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.Quanta Computer (Firm)Samsung (Firm)Lincoln Laboratory (PO7000261350)Intel Corporatio
Runtime Adaptive Hybrid Query Engine based on FPGAs
This paper presents the fully integrated hardware-accelerated query engine for large-scale datasets in the context of Semantic Web databases. As queries are typically unknown at design time, a static approach is not feasible and not flexible to cover a wide range of queries at system runtime. Therefore, we introduce a runtime reconfigurable accelerator based on a Field Programmable Gate Array (FPGA), which transparently incorporates with the freely available Semantic Web database LUPOSDATE. At system runtime, the proposed approach dynamically generates an optimized hardware accelerator in terms of an FPGA configuration for each individual query and transparently retrieves the query result to be displayed to the user. During hardware-accelerated execution the host supplies triple data to the FPGA and retrieves the results from the FPGA via PCIe interface. The benefits and limitations are evaluated on large-scale synthetic datasets with up to 260 million triples as well as the widely known Billion Triples Challenge
Data-intensive Systems on Modern Hardware : Leveraging Near-Data Processing to Counter the Growth of Data
Over the last decades, a tremendous change toward using information technology in almost every daily routine of our lives can be perceived in our society, entailing an incredible growth of data collected day-by-day on Web, IoT, and AI applications.
At the same time, magneto-mechanical HDDs are being replaced by semiconductor storage such as SSDs, equipped with modern Non-Volatile Memories, like Flash, which yield significantly faster access latencies and higher levels of parallelism. Likewise, the execution speed of processing units increased considerably as nowadays server architectures comprise up to multiple hundreds of independently working CPU cores along with a variety of specialized computing co-processors such as GPUs or FPGAs.
However, the burden of moving the continuously growing data to the best fitting processing unit is inherently linked to today’s computer architecture that is based on the data-to-code paradigm. In the light of Amdahl's Law, this leads to the conclusion that even with today's powerful processing units, the speedup of systems is limited since the fraction of parallel work is largely I/O-bound.
Therefore, throughout this cumulative dissertation, we investigate the paradigm shift toward code-to-data, formally known as Near-Data Processing (NDP), which relieves the contention on the I/O bus by offloading processing to intelligent computational storage devices, where the data is originally located.
Firstly, we identified Native Storage Management as the essential foundation for NDP due to its direct control of physical storage management within the database. Upon this, the interface is extended to propagate address mapping information and to invoke NDP functionality on the storage device. As the former can become very large, we introduce Physical Page Pointers as one novel NDP abstraction for self-contained immutable database objects.
Secondly, the on-device navigation and interpretation of data are elaborated. Therefore, we introduce cross-layer Parsers and Accessors as another NDP abstraction that can be executed on the heterogeneous processing capabilities of modern computational storage devices. Thereby, the compute placement and resource configuration per NDP request is identified as a major performance criteria. Our experimental evaluation shows an improvement in the execution durations of 1.4x to 2.7x compared to traditional systems. Moreover, we propose a framework for the automatic generation of Parsers and Accessors on FPGAs to ease their application in NDP.
Thirdly, we investigate the interplay of NDP and modern workload characteristics like HTAP. Therefore, we present different offloading models and focus on an intervention-free execution. By propagating the Shared State with the latest modifications of the database to the computational storage device, it is able to process data with transactional guarantees. Thus, we achieve to extend the design space of HTAP with NDP by providing a solution that optimizes for performance isolation, data freshness, and the reduction of data transfers. In contrast to traditional systems, we experience no significant drop in performance when an OLAP query is invoked but a steady and 30% faster throughput.
Lastly, in-situ result-set management and consumption as well as NDP pipelines are proposed to achieve flexibility in processing data on heterogeneous hardware. As those produce final and intermediary results, we continue investigating their management and identified that an on-device materialization comes at a low cost but enables novel consumption modes and reuse semantics. Thereby, we achieve significant performance improvements of up to 400x by reusing once materialized results multiple times
Recommended from our members
SoC-Based In-Storage Processing: Bringing Flexibility and Efficiency to Near-Data Processing
Data are among the most valuable assets in the modern world, and they have caused a revolutionary stage in human life. Nowadays, companies make knowledge-based decisions by analyzing a huge volume of data, super-scale data centers are used to process customers’ data to suggest products to them, government services rely on the data people provide to them, and there are many similar cases wherein data are used as an important asset. Data are originally stored in storage systems. To process data, application servers need to fetch the data from storage units, which imposes the cost of moving the data to the system. This cost has a direct relationship to the distance of the processing engines from the data, and this is the key motivation for the emergence of distributed processing platforms such as Hadoop, which bring the process closer to the data.In-storage processing (ISP) pushes the “bring the process to data” paradigm to its ultimate boundaries by utilizing processing engines inside the storage units to process data. The architecture of modern solid-state drives (SSDs) provides a suitable environment for implementing such technology. Thus, this dissertation focuses on SSD architectures that are able to run user applications in-place, which are called computational storage devices (CSDs). In this dissertation, we propose CSD architectures and investigate the benefits of deploying CSDs for running different applications. This research uses a practical approach that includes building fully functional prototypes of the proposed CSD architectures, developing storage systems equipped with the CSDs, and running different benchmarks to investigate the benefits of deploying the CSDs in the systems. This research proposes two different CSD architectures, namely CompStor and Catalina.These are the first CSDs to be equipped with a dedicated ISP engine for running user applications in-place that includes a quad-core ARM Cortex-A53 processor together with FPGA- and application-specific integrated circuit (ASIC) based accelerators. The proposed architectures run a full-fledged operating system inside, which provides a flexible environment for running a wide range of user applications in-place. The system-on-chip (SOC) based architecture of Catalina CSD, together with a software stack developed for seamless deployment of the CSD, makes it a platform for the implementation of different ISP concepts and ideas.To the best of our knowledge, Catalina is the only ISP platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and message passing interface (MPI) based applications in-place without any modifications to the underlying distributed processing framework. We performed extensive experimental tests using several datasets on both CompStor and Catalina CSDs. The experimental results show up to 2.2x and 4.3x improvements in performance and energy consumption, respectively, for running Hadoop MapReduce benchmarks using Catalina CSDs and up to 5.4x and 8.9x improvements for running 1-, 2-, and 3-dimensional DFT algorithms due to the Neon SIMD engines inside Catalina. Additionally, using FPGA-based accelerators, Catalina CSDs can improve the performance and energy consumption of a highly demanding image similarity search application up to 11x and 7x, respectively