2,852 research outputs found
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
Near-Data Processing refers to an architectural hardware and software
paradigm, based on the co-location of storage and compute units. Ideally, it
will allow to execute application-defined data- or compute-intensive operations
in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data
Processing seeks to minimize expensive data movement, improving performance,
scalability, and resource-efficiency. Processing-in-Memory is a sub-class of
Near-Data processing that targets data processing directly within memory (DRAM)
chips. The effective use of Near-Data Processing mandates new architectures,
algorithms, interfaces, and development toolchains
Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Modern real-time business analytic consist of heterogeneous workloads (e.g,
database queries, graph processing, and machine learning). These analytic
applications need programming environments that can capture all aspects of the
constituent workloads (including data models they work on and movement of data
across processing engines). Polystore systems suit such applications; however,
these systems currently execute on CPUs and the slowdown of Moore's Law means
they cannot meet the performance and efficiency requirements of modern
workloads. We envision Polystore++, an architecture to accelerate existing
polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs).
Polystore++ systems can achieve high performance at low power by identifying
and offloading components of a polystore system that are amenable to
acceleration using specialized hardware. Building a Polystore++ system is
challenging and introduces new research problems motivated by the use of
hardware accelerators (e.g, optimizing and mapping query plans across
heterogeneous computing units and exploiting hardware pipelining and
parallelism to improve performance). In this paper, we discuss these challenges
in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201
Co-KV: A Collaborative Key-Value Store Using Near-Data Processing to Improve Compaction for the LSM-tree
Log-structured merge tree (LSM-tree) based key-value stores are widely
employed in large-scale storage systems. In the compaction of the key-value
store, SSTables are merged with overlapping key ranges and sorted for data
queries. This, however, incurs write amplification and thus degrades system
performance, especially under update-intensive workloads. Current optimization
focuses mostly on the reduction of the overload of compaction in the host, but
rarely makes full use of computation in the device. To address these issues, we
propose Co-KV, a Collaborative Key-Value store between the host and a near-data
processing ( i.e., NDP) model based SSD to improve compaction. Co-KV offers
three benefits: (1) reducing write amplification by a compaction offloading
scheme between host and device; (2) relieving the overload of compaction in the
host and leveraging computation in the SSD based on the NDP model; and (3)
improving the performance of LSM-tree based key-value stores under
update-intensive workloads.
Extensive db_bench experiment show that Co-KV largely achieves a 2.0x overall
throughput improvement, and a write amplification reduction by up to 36.0% over
the state-of-the-art LevelDB. Under YCSB workloads, Co-KV increases the
throughput by 1.7x - 2.4x while decreases the write amplification and average
latency by up to 30.0% and 43.0%, respectively
Application-Driven Near-Data Processing for Similarity Search
Similarity search is a key to a variety of applications including
content-based search for images and video, recommendation systems, data
deduplication, natural language processing, computer vision, databases,
computational biology, and computer graphics. At its core, similarity search
manifests as k-nearest neighbors (kNN), a computationally simple primitive
consisting of highly parallel distance calculations and a global top-k sort.
However, kNN is poorly supported by today's architectures because of its high
memory bandwidth requirements.
This paper proposes an application-driven near-data processing accelerator
for similarity search: the Similarity Search Associative Memory (SSAM). By
instantiating compute units close to memory, SSAM benefits from the higher
memory bandwidth and density exposed by emerging memory technologies. We
evaluate the SSAM design down to layout on top of the Micron hybrid memory cube
(HMC), and show that SSAM can achieve up to two orders of magnitude
area-normalized throughput and energy efficiency improvement over multicore
CPUs; we also show SSAM is faster and more energy efficient than competing GPUs
and FPGAs. Finally, we show that SSAM is also useful for other data intensive
tasks like kNN index construction, and can be generalized to semantically
function as a high capacity content addressable memory.Comment: 15 pages, 8 figures, 7 table
Enabling Practical Processing in and near Memory for Data-Intensive Computing
Modern computing systems suffer from the dichotomy between computation on one
side, which is performed only in the processor (and accelerators), and data
storage/movement on the other, which all other parts of the system are
dedicated to. Due to this dichotomy, data moves a lot in order for the system
to perform computation on it. Unfortunately, data movement is extremely
expensive in terms of energy and latency, much more so than computation. As a
result, a large fraction of system energy is spent and performance is lost
solely on moving data in a modern computing system.
In this work, we re-examine the idea of reducing data movement by performing
Processing in Memory (PIM). PIM places computation mechanisms in or near where
the data is stored (i.e., inside the memory chips, in the logic layer of
3D-stacked logic and DRAM, or in the memory controllers), so that data movement
between the computation units and memory is reduced or eliminated. While the
idea of PIM is not new, we examine two new approaches to enabling PIM: 1)
exploiting analog properties of DRAM to perform massively-parallel operations
in memory, and 2) exploiting 3D-stacked memory technology design to provide
high bandwidth to in-memory logic. We conclude by discussing work on solving
key challenges to the practical adoption of PIM.Comment: A version of this work is to appear in a DAC 2019 Special Session as
an Invited Paper in June 2019. arXiv admin note: substantial text overlap
with arXiv:1903.0398
Evolutionary Cell Aided Design for Neural Network Architectures
Mathematical theory shows us that multilayer feedforward Artificial Neural
Networks(ANNs) are universal function approximators, capable of approximating
any measurable function to any desired degree of accuracy. In practice
designing practical and efficient neural network architectures require
significant effort and expertise. We present a novel software framework called
Evolutionary Cell Aided Design(ECAD) meant to aid in the exploration and design
of efficient Neural Network Architectures(NNAs) for reconfigurable hardware.
Given a general neural network structure and a set of constraints and fitness
functions, the framework will explore both the space of possible NNA and the
space of possible hardware designs, using evolutionary algorithms, and attempt
to find the fittest co-design solutions according to a predefined set of goals.
We test the framework on an image classification task and use the MNIST data
set of hand written digits with an Intel Arria 10 GX 1150 device as our target
platform. We design and implement a modular and scalable 2D systolic array with
enhancements for machine learning that can be used by the framework for the
hardware search space. Our results demonstrate the ability to pair neural
network design and hardware development together using an evolutionary
algorithm and removing traditional human-in-the-loop development tasks. By
running various experiments of the fittest solutions for neural network and
hardware searches, we demonstrate the full end-to-end capabilities of the ECAD
framework.Comment: Text and image edit
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
In-RDBMS Hardware Acceleration of Advanced Analytics
The data revolution is fueled by advances in machine learning, databases, and
hardware design. Programmable accelerators are making their way into each of
these areas independently. As such, there is a void of solutions that enables
hardware acceleration at the intersection of these disjoint fields. This paper
sets out to be the initial step towards a unifying solution for in-Database
Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such
as FPGAs, for in-database analytics currently requires hand-designing the
hardware and manually routing the data. Instead, DAnA automatically maps a
high-level specification of advanced analytics queries to an FPGA accelerator.
The accelerator implementation is generated for a User Defined Function (UDF),
expressed as a part of an SQL query using a Python-embedded Domain-Specific
Language (DSL). To realize an efficient in-database integration, DAnA
accelerators contain a novel hardware structure, Striders, that directly
interface with the buffer pool of the database. Striders extract, cleanse, and
process the training data tuples that are consumed by a multi-threaded FPGA
engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL
to generate hardware accelerators for a range of real-world and synthetic
datasets running diverse ML algorithms. Results show that DAnA-enhanced
PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets,
with a maximum of 28.2x. Moreover, DAnA-enhanced PostgreSQL is, on average,
4.0x faster than the multi-threaded Apache MADLib running on Greenplum. DAnA
provides these benefits while hiding the complexity of hardware design from
data scientists and allowing them to express the algorithm in =30-60 lines of
Python
An Open-Source Benchmark Suite for Cloud and IoT Microservices
Cloud services have recently started undergoing a major shift from monolithic
applications, to graphs of hundreds of loosely-coupled microservices.
Microservices fundamentally change a lot of assumptions current cloud systems
are designed with, and present both opportunities and challenges when
optimizing for quality of service (QoS) and utilization. In this paper we
explore the implications microservices have across the cloud system stack. We
first present DeathStarBench, a novel, open-source benchmark suite built with
microservices that is representative of large end-to-end services, modular and
extensible. DeathStarBench includes a social network, a media service, an
e-commerce site, a banking system, and IoT applications for coordination
control of UAV swarms. We then use DeathStarBench to study the architectural
characteristics of microservices, their implications in networking and
operating systems, their challenges with respect to cluster management, and
their trade-offs in terms of application design and programming frameworks.
Finally, we explore the tail at scale effects of microservices in real
deployments with hundreds of users, and highlight the increased pressure they
put on performance predictability
Untangling Blockchain: A Data Processing View of Blockchain Systems
Blockchain technologies are gaining massive momentum in the last few years.
Blockchains are distributed ledgers that enable parties who do not fully trust
each other to maintain a set of global states. The parties agree on the
existence, values and histories of the states. As the technology landscape is
expanding rapidly, it is both important and challenging to have a firm grasp of
what the core technologies have to offer, especially with respect to their data
processing capabilities. In this paper, we first survey the state of the art,
focusing on private blockchains (in which parties are authenticated). We
analyze both in-production and research systems in four dimensions: distributed
ledger, cryptography, consensus protocol and smart contract. We then present
BLOCKBENCH, a benchmarking framework for understanding performance of private
blockchains against data processing workloads. We conduct a comprehensive
evaluation of three major blockchain systems based on BLOCKBENCH, namely
Ethereum, Parity and Hyperledger Fabric. The results demonstrate several
trade-offs in the design space, as well as big performance gaps between
blockchain and database systems. Drawing from design principles of database
systems, we discuss several research directions for bringing blockchain
performance closer to the realm of databases.Comment: arXiv admin note: text overlap with arXiv:1703.0405
- …