1,160 research outputs found
A Modern Primer on Processing in Memory
Modern computing systems are overwhelmingly designed to move data to
computation. This design choice goes directly against at least three key trends
in computing that cause performance, scalability and energy bottlenecks: (1)
data access is a key bottleneck as many important applications are increasingly
data-intensive, and memory bandwidth and energy do not scale well, (2) energy
consumption is a key limiter in almost all computing platforms, especially
server and mobile systems, (3) data movement, especially off-chip to on-chip,
is very expensive in terms of bandwidth, energy and latency, much more so than
computation. These trends are especially severely-felt in the data-intensive
server and energy-constrained mobile systems of today. At the same time,
conventional memory technology is facing many technology scaling challenges in
terms of reliability, energy, and performance. As a result, memory system
architects are open to organizing memory in different ways and making it more
intelligent, at the expense of higher cost. The emergence of 3D-stacked memory
plus logic, the adoption of error correcting codes inside the latest DRAM
chips, proliferation of different main memory standards and chips, specialized
for different purposes (e.g., graphics, low-power, high bandwidth, low
latency), and the necessity of designing new solutions to serious reliability
and security issues, such as the RowHammer phenomenon, are an evidence of this
trend. This chapter discusses recent research that aims to practically enable
computation close to data, an approach we call processing-in-memory (PIM). PIM
places computation mechanisms in or near where the data is stored (i.e., inside
the memory chips, in the logic layer of 3D-stacked memory, or in the memory
controllers), so that data movement between the computation units and memory is
reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems
The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant
PID2019-107255GB-C21/AEI/10.13039/501100011033; the European Union’s Horizon 2020 Framework Programme under grant agreement No. 878752 (MASTECS) and agreement No. 779877 (Mont-Blanc 2020); the European Research Council (ERC) grant agreement No. 772773
(SuPerCom); and the Natural Sciences and Engineering Research Council of Canada (NSERC)Peer ReviewedPostprint (author's final draft
Processor-In-Memory (PIM) Based Architectures for PetaFlops Potential Massively Parallel Processing
The report summarizes the work performed at the University of Notre Dame under a NASA grant from July 15, 1995 through July 14, 1996. Researchers involved in the work included the PI, Dr. Peter M. Kogge, and three graduate students under his direction in the Computer Science and Engineering Department: Stephen Dartt, Costin Iancu, and Lakshmi Narayanaswany. The organization of this report is as follows. Section 2 is a summary of the problem addressed by this work. Section 3 is a summary of the project's objectives and approach. Section 4 summarizes PIM technology briefly. Section 5 overviews the main results of the work. Section 6 then discusses the importance of the results and future directions. Also attached to this report are copies of several technical reports and publications whose contents directly reflect results developed during this study
Doctor of Philosophy
dissertationThe computing landscape is undergoing a major change, primarily enabled by ubiquitous wireless networks and the rapid increase in the use of mobile devices which access a web-based information infrastructure. It is expected that most intensive computing may either happen in servers housed in large datacenters (warehouse- scale computers), e.g., cloud computing and other web services, or in many-core high-performance computing (HPC) platforms in scientific labs. It is clear that the primary challenge to scaling such computing systems into the exascale realm is the efficient supply of large amounts of data to hundreds or thousands of compute cores, i.e., building an efficient memory system. Main memory systems are at an inflection point, due to the convergence of several major application and technology trends. Examples include the increasing importance of energy consumption, reduced access stream locality, increasing failure rates, limited pin counts, increasing heterogeneity and complexity, and the diminished importance of cost-per-bit. In light of these trends, the memory system requires a major overhaul. The key to architecting the next generation of memory systems is a combination of the prudent incorporation of novel technologies, and a fundamental rethinking of certain conventional design decisions. In this dissertation, we study every major element of the memory system - the memory chip, the processor-memory channel, the memory access mechanism, and memory reliability, and identify the key bottlenecks to efficiency. Based on this, we propose a novel main memory system with the following innovative features: (i) overfetch-aware re-organized chips, (ii) low-cost silicon photonic memory channels, (iii) largely autonomous memory modules with a packet-based interface to the proces- sor, and (iv) a RAID-based reliability mechanism. Such a system is energy-efficient, high-performance, low-complexity, reliable, and cost-effective, making it ideally suited to meet the requirements of future large-scale computing systems
Doctor of Philosophy in Computing
dissertatio
Physically Dense Server Architectures.
Distributed, in-memory key-value stores have emerged as one of today's most
important data center workloads. Being critical for the scalability of modern
web services, vast resources are dedicated to key-value stores in order
to ensure that quality of service guarantees are met. These resources include:
many server racks to store terabytes of key-value data, the power necessary to
run all of the machines, networking equipment and bandwidth, and the data center
warehouses used to house the racks.
There is, however, a mismatch between the key-value store software and the
commodity servers on which it is run, leading to inefficient use of resources.
The primary cause of inefficiency is the overhead incurred from processing
individual network packets, which typically carry small payloads, and require
minimal compute resources. Thus, one of the key challenges as we enter the
exascale era is how to best adjust to the paradigm shift from compute-centric
to storage-centric data centers.
This dissertation presents a hardware/software solution that addresses the
inefficiency issues present in the modern data centers on which key-value
stores are currently deployed. First, it proposes two physical server
designs, both of which use 3D-stacking technology and low-power CPUs to improve
density and efficiency. The first 3D architecture---Mercury---consists of stacks
of low-power CPUs with 3D-stacked DRAM. The second
architecture---Iridium---replaces DRAM with 3D NAND Flash to improve density.
The second portion of this dissertation proposes and enhanced version of the
Mercury server design---called KeyVault---that incorporates integrated,
zero-copy network interfaces along with an integrated switching fabric. In order
to utilize the integrated networking hardware, as well as reduce the
response time of requests, a custom networking protocol is proposed. Unlike
prior works on accelerating key-value stores---e.g., by completely bypassing the
CPU and OS when processing requests---this work only bypasses the CPU and OS
when placing network payloads into a process' memory. The insight behind this is
that because most of the overhead comes from processing packets in the OS
kernel---and not the request processing itself---direct placement of packet's
payload is sufficient to provide higher throughput and lower latency than prior
approaches.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111414/1/atgutier_1.pd
- …