205 research outputs found
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
Memory-Centric Computing
Memory-centric computing aims to enable computation capability in and near
all places where data is generated and stored. As such, it can greatly reduce
the large negative performance and energy impact of data access and data
movement, by fundamentally avoiding data movement and reducing data access
latency & energy. Many recent studies show that memory-centric computing can
greatly improve system performance and energy efficiency. Major industrial
vendors and startup companies have also recently introduced memory chips that
have sophisticated computation capabilities.
This talk describes promising ongoing research and development efforts in
memory-centric computing. We classify such efforts into two major fundamental
categories: 1) processing using memory, which exploits analog operational
properties of memory structures to perform massively-parallel operations in
memory, and 2) processing near memory, which integrates processing capability
in memory controllers, the logic layer of 3D-stacked memory technologies, or
memory chips to enable high-bandwidth and low-latency memory access to
near-memory logic. We show both types of architectures (and their combination)
can enable orders of magnitude improvements in performance and energy
consumption of many important workloads, such as graph analytics, databases,
machine learning, video processing, climate modeling, genome analysis. We
discuss adoption challenges for the memory-centric computing paradigm and
conclude with some research & development opportunities.Comment: To appear as an invited special session paper at DAC 202
Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud
Neural networks (NNs) are growing in importance and complexity. A neural
network's performance (and energy efficiency) can be bound either by
computation or memory resources. The processing-in-memory (PIM) paradigm, where
computation is placed near or within memory arrays, is a viable solution to
accelerate memory-bound NNs. However, PIM architectures vary in form, where
different PIM approaches lead to different trade-offs. Our goal is to analyze,
discuss, and contrast DRAM-based PIM architectures for NN performance and
energy efficiency. To do so, we analyze three state-of-the-art PIM
architectures: (1) UPMEM, which integrates processors and DRAM arrays into a
single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge
devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute
bit-serial operations. Our analysis reveals that PIM greatly benefits
memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when
the GPU requires memory oversubscription for a general matrix-vector
multiplication kernel; (2) Mensa improves energy efficiency and throughput by
3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3)
SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude
that the ideal PIM architecture for NN models depends on a model's distinct
attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE
Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with
arXiv:2109.1432
- …