385 research outputs found
A High-performance, Energy-efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and
complex memory access patterns are increasingly challenging to manage. Direct
memory access engines (DMAEs) are critically needed to transfer data
independently of the processing elements, hiding latency and achieving high
throughput even for complex access patterns to high-latency memory. With the
prevalence of heterogeneous systems, DMAEs must operate efficiently in
increasingly diverse environments. This work proposes a modular and highly
configurable open-source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized independently. The
front-end implements the control plane binding to the surrounding system. The
mid-end accelerates complex data transfer patterns such as multi-dimensional
transfers, scattering, or gathering. The back-end interfaces with the on-chip
communication fabric (data plane). We assess the efficiency of iDMA in various
instantiations: In high-performance systems, we achieve speedups of up to 15.8x
with only 1 % additional area compared to a base system without a DMAE. We
achieve an area reduction of 10 % while improving ML inference performance by
23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We
provide area, timing, latency, and performance characterization to guide its
instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio
Deployment of Deep Neural Networks on Dedicated Hardware Accelerators
Deep Neural Networks (DNNs) have established themselves as powerful tools for
a wide range of complex tasks, for example computer vision or natural language
processing. DNNs are notoriously demanding on compute resources and as a
result, dedicated hardware accelerators for all use cases are developed. Different
accelerators provide solutions from hyper scaling cloud environments for the
training of DNNs to inference devices in embedded systems. They implement
intrinsics for complex operations directly in hardware. A common example
are intrinsics for matrix multiplication. However, there exists a gap between
the ecosystems of applications for deep learning practitioners and hardware
accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics
is still mainly defined by human hardware and software experts.
Methods to automatically utilize hardware intrinsics in DNN operators are a
subject of active research. Existing literature often works with transformationdriven
approaches, which aim to establish a sequence of program rewrites and
data-layout transformations such that the hardware intrinsic can be used to
compute the operator. However, the complexity this of task has not yet been
explored, especially for less frequently used operators like Capsule Routing. And
not only the implementation of DNN operators with intrinsics is challenging,
also their optimization on the target device is difficult. Hardware-in-the-loop
tools are often used for this problem. They use latency measurements of implementations
candidates to find the fastest one. However, specialized accelerators
can have memory and programming limitations, so that not every arithmetically
correct implementation is a valid program for the accelerator. These invalid
implementations can lead to unnecessary long the optimization time.
This work investigates the complexity of transformation-driven processes to
automatically embed hardware intrinsics into DNN operators. It is explored
with a custom, graph-based intermediate representation (IR). While operators
like Fully Connected Layers can be handled with reasonable effort, increasing
operator complexity or advanced data-layout transformation can lead to scaling issues.
Building on these insights, this work proposes a novel method to embed
hardware intrinsics into DNN operators. It is based on a dataflow analysis.
The dataflow embedding method allows the exploration of how intrinsics and
operators match without explicit transformations. From the results it can derive
the data layout and program structure necessary to compute the operator with
the intrinsic. A prototype implementation for a dedicated hardware accelerator
demonstrates state-of-the art performance for a wide range of convolutions, while
being agnostic to the data layout. For some operators in the benchmark, the
presented method can also generate alternative implementation strategies to
improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while
reducing the memory footprint. Lastly, by curating the initial set of possible
implementations for the hardware-in-the-loop optimization, the median timeto-
solution is reduced by a factor of ×2.40. At the same time, the possibility to
have prolonged searches due a bad initial set of implementations is reduced,
improving the optimization’s robustness by ×2.35
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is and than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems
Full Stack Optimization of Transformer Inference: a Survey
Recent advances in state-of-the-art DNN architecture design have been moving
toward Transformer models. These models achieve superior accuracy across a wide
range of applications. This trend has been consistent over the past several
years since Transformer models were originally introduced. However, the amount
of compute and bandwidth required for inference of recent Transformer models is
growing at a significant rate, and this has made their deployment in
latency-sensitive applications challenging. As such, there has been an
increased focus on making Transformer models more efficient, with methods that
range from changing the architecture design, all the way to developing
dedicated domain-specific accelerators. In this work, we survey different
approaches for efficient Transformer inference, including: (i) analysis and
profiling of the bottlenecks in existing Transformer architectures and their
similarities and differences with previous convolutional models; (ii)
implications of Transformer architecture on hardware, including the impact of
non-linear operations such as Layer Normalization, Softmax, and GELU, as well
as linear operations, on hardware design; (iii) approaches for optimizing a
fixed Transformer architecture; (iv) challenges in finding the right mapping
and scheduling of operations for Transformer models; and (v) approaches for
optimizing Transformer models by adapting the architecture using neural
architecture search. Finally, we perform a case study by applying the surveyed
optimizations on Gemmini, the open-source, full-stack DNN accelerator
generator, and we show how each of these approaches can yield improvements,
compared to previous benchmark results on Gemmini. Among other things, we find
that a full-stack co-design approach with the aforementioned methods can result
in up to 88.7x speedup with a minimal performance degradation for Transformer
inference
DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips
To understand and improve DRAM performance, reliability, security and energy
efficiency, prior works study characteristics of commodity DRAM chips.
Unfortunately, state-of-the-art open source infrastructures capable of
conducting such studies are obsolete, poorly supported, or difficult to use, or
their inflexibility limit the types of studies they can conduct.
We propose DRAM Bender, a new FPGA-based infrastructure that enables
experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three
key features at the same time. First, DRAM Bender enables directly interfacing
with a DRAM chip through its low-level interface. This allows users to issue
DRAM commands in arbitrary order and with finer-grained time intervals compared
to other open source infrastructures. Second, DRAM Bender exposes easy-to-use
C++ and Python programming interfaces, allowing users to quickly and easily
develop different types of DRAM experiments. Third, DRAM Bender is easily
extensible. The modular design of DRAM Bender allows extending it to (i)
support existing and emerging DRAM interfaces, and (ii) run on new commercial
or custom FPGA boards with little effort.
To demonstrate that DRAM Bender is a versatile infrastructure, we conduct
three case studies, two of which lead to new observations about the DRAM
RowHammer vulnerability. In particular, we show that data patterns supported by
DRAM Bender uncovers a larger set of bit-flips on a victim row compared to the
data patterns commonly used by prior work. We demonstrate the extensibility of
DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3
support. DRAM Bender is freely and openly available at
https://github.com/CMU-SAFARI/DRAM-Bender.Comment: To appear in TCAD 202
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Efficient and Effective Multi-Objective Optimization for Real-Time Multi-Task Systems
Embedded real-time multi-task systems must often not only comply with timing constraints but also need to meet energy requirements. However, optimizing energy consumption might lead to higher Worst-Case Execution Time (WCET), leading to an un-schedulable system, as frequently executed code can easily differ from timing-critical code. To handle such an impasse in this paper, we formulate a Metaheuristic Algorithm-based Multi-objective Optimization (MAMO) for multi-task real-time systems. But, performing multiple WCET, energy, and schedulability analyses to solve a MAMO poses a bottleneck concerning compilation times. Therefore, we propose two novel approaches - Path-based Constraint Approach (PCA) and Impact-based Constraint Approach (ICA) - to reduce the solution search space size and to cope with this problem. Evaluations showed that PCA and ICA reduced compilation times by 85.31% and 77.31%, on average, over MAMO. For all the task sets, out of all solutions found by ICA-FPA, on average, 88.89% were on the final Pareto front
Massive Data-Centric Parallelism in the Chiplet Era
Traditionally, massively parallel applications are executed on distributed
systems, where computing nodes are distant enough that the parallelization
schemes must minimize communication and synchronization to achieve scalability.
Mapping communication-intensive workloads to distributed systems requires
complicated problem partitioning and dataset pre-processing. With the current
AI-driven trend of having thousands of interconnected processors per chip,
there is an opportunity to re-think these communication-bottlenecked workloads.
This bottleneck often arises from data structure traversals, which cause
irregular memory accesses and poor cache locality.
Recent works have introduced task-based parallelization schemes to accelerate
graph traversal and other sparse workloads. Data structure traversals are split
into tasks and pipelined across processing units (PUs). Dalorex demonstrated
the highest scalability (up to thousands of PUs on a single chip) by having the
entire dataset on-chip, scattered across PUs, and executing the tasks at the PU
where the data is local. However, it also raised questions on how to scale to
larger datasets when all the memory is on chip, and at what cost.
To address these challenges, we propose a scalable architecture composed of a
grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time
reconfiguration enables creating chip products that optimize for different
target metrics, such as time-to-solution, energy, or cost, while software
reconfigurations avoid network saturation when scaling to millions of PUs
across many chip packages. We evaluate six applications and four datasets, with
several configurations and memory technologies, to provide a detailed analysis
of the performance, power, and cost of data-local execution at scale. Our
parallelization of Breadth-First-Search with RMAT-26 across a million PUs
reaches 3323 GTEPS
TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck
in modern computing systems. However, current real-world PIM systems have the
inherent disadvantage that their hardware is more constrained than in
conventional processors (CPU, GPU), due to the difficulty and cost of building
processing elements near or inside the memory. As a result, general-purpose PIM
architectures support fairly limited instruction sets and struggle to execute
complex operations such as transcendental functions and other hard-to-calculate
operations (e.g., square root). These operations are particularly important for
some modern workloads, e.g., activation functions in machine learning
applications.
In order to provide support for transcendental (and other hard-to-calculate)
functions in general-purpose PIM systems, we present \emph{TransPimLib}, a
library that provides CORDIC-based and LUT-based methods for trigonometric
functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and
perform a thorough evaluation of TransPimLib's methods in terms of performance
and accuracy, using microbenchmarks and three full workloads (Blackscholes,
Sigmoid, Softmax). We open-source all our code and datasets
at~\url{https://github.com/CMU-SAFARI/transpimlib}.Comment: Our open-source software is available at
https://github.com/CMU-SAFARI/transpimli
- …