324 research outputs found
Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube
Three-dimensional (3D)-stacking technology, which enables the integration of
DRAM and logic dies, offers high bandwidth and low energy consumption. This
technology also empowers new memory designs for executing tasks not
traditionally associated with memories. A practical 3D-stacked memory is Hybrid
Memory Cube (HMC), which provides significant access bandwidth and low power
consumption in a small area. Although several studies have taken advantage of
the novel architecture of HMC, its characteristics in terms of latency and
bandwidth or their correlation with temperature and power consumption have not
been fully explored. This paper is the first, to the best of our knowledge, to
characterize the thermal behavior of HMC in a real environment using the AC-510
accelerator and to identify temperature as a new limitation for this
state-of-the-art design space. Moreover, besides bandwidth studies, we
deconstruct factors that contribute to latency and reveal their sources for
high- and low-load accesses. The results of this paper demonstrates essential
behaviors and performance bottlenecks for future explorations of
packet-switched and 3D-stacked memories.Comment: EEE Catalog Number: CFP17236-USB ISBN 13: 978-1-5386-1232-
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
Retrospective: A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing
Our ISCA 2015 paper provides a new programmable processing-in-memory (PIM)
architecture and system design that can accelerate key data-intensive
applications, with a focus on graph processing workloads. Our major idea was to
completely rethink the system, including the programming model, data
partitioning mechanisms, system support, instruction set architecture, along
with near-memory execution units and their communication architecture, such
that an important workload can be accelerated at a maximum level using a
distributed system of well-connected near-memory accelerators. We built our
accelerator system, Tesseract, using 3D-stacked memories with logic layers,
where each logic layer contains general-purpose processing cores and cores
communicate with each other using a message-passing programming model. Cores
could be specialized for graph processing (or any other application to be
accelerated).
To our knowledge, our paper was the first to completely design a near-memory
accelerator system from scratch such that it is both generally programmable and
specifically customizable to accelerate important applications, with a case
study on major graph processing workloads. Ensuing work in academia and
industry showed that similar approaches to system design can greatly benefit
both graph processing workloads and other applications, such as machine
learning, for which ideas from Tesseract seem to have been influential.
This short retrospective provides a brief analysis of our ISCA 2015 paper and
its impact. We briefly describe the major ideas and contributions of the work,
discuss later works that built on it or were influenced by it, and make some
educated guesses on what the future may bring on PIM and accelerator systems.Comment: Selected to the 50th Anniversary of ISCA (ACM/IEEE International
Symposium on Computer Architecture), Commemorative Issue, 202
- …