20 research outputs found
The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Large Language Models (LLMs) like GPT are state-of-the-art text generation
models that provide significant assistance in daily routines. However, LLM
execution is inherently sequential, since they only produce one token at a
time, thus incurring low hardware utilization on modern GPUs. Batching and
speculative decoding are two techniques to improve GPU hardware utilization in
LLM inference. To study their synergy, we implement a prototype implementation
and perform an extensive characterization analysis on various LLM models and
GPU architectures. We observe that the optimal speculation length depends on
the batch size used. We analyze the key observation and build a quantitative
model to explain it. Based on our analysis, we propose a new adaptive
speculative decoding strategy that chooses the optimal speculation length for
different batch sizes. Our evaluations show that our proposed method can
achieve equal or better performance than the state-of-the-art speculation
decoding schemes with fixed speculation length
Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture
Many modern workloads, such as neural networks, databases, and graph
processing, are fundamentally memory-bound. For such workloads, the data
movement between main memory and CPU cores imposes a significant overhead in
terms of both latency and energy. A major reason is that this communication
happens through a narrow bus with high latency and limited bandwidth, and the
low data reuse in memory-bound workloads is insufficient to amortize the cost
of main memory access. Fundamentally addressing this data movement bottleneck
requires a paradigm where the memory system assumes an active role in computing
by integrating processing capabilities. This paradigm is known as
processing-in-memory (PIM).
Recent research explores different forms of PIM architectures, motivated by
the emergence of new 3D-stacked memory technologies that integrate memory with
a logic layer where processing elements can be easily placed. Past works
evaluate these architectures in simulation or, at best, with simplified
hardware prototypes. In contrast, the UPMEM company has designed and
manufactured the first publicly-available real-world PIM architecture.
This paper provides the first comprehensive analysis of the first
publicly-available real-world PIM architecture. We make two key contributions.
First, we conduct an experimental characterization of the UPMEM-based PIM
system using microbenchmarks to assess various architecture limits such as
compute throughput and memory bandwidth, yielding new insights. Second, we
present PrIM, a benchmark suite of 16 workloads from different application
domains (e.g., linear algebra, databases, graph processing, neural networks,
bioinformatics).Comment: Our open source software is available at
https://github.com/CMU-SAFARI/prim-benchmark
NATSA: A Near-Data Processing Accelerator for Time Series Analysis
Time series analysis is a key technique for extracting and predicting events
in domains as diverse as epidemiology, genomics, neuroscience, environmental
sciences, economics, and more. Matrix profile, the state-of-the-art algorithm
to perform time series analysis, computes the most similar subsequence for a
given query subsequence within a sliced time series. Matrix profile has low
arithmetic intensity, but it typically operates on large amounts of time series
data. In current computing systems, this data needs to be moved between the
off-chip memory units and the on-chip computation units for performing matrix
profile. This causes a major performance bottleneck as data movement is
extremely costly in terms of both execution time and energy.
In this work, we present NATSA, the first Near-Data Processing accelerator
for time series analysis. The key idea is to exploit modern 3D-stacked High
Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile
computation near memory, where time series data resides. NATSA provides three
key benefits: 1) quickly computing the matrix profile for a wide range of
applications by building specialized energy-efficient floating-point arithmetic
processing units close to HBM, 2) improving the energy efficiency and execution
time by reducing the need for data movement over slow and energy-hungry buses
between the computation units and the memory units, and 3) analyzing time
series data at scale by exploiting low-latency, high-bandwidth, and
energy-efficient memory access provided by HBM. Our experimental evaluation
shows that NATSA improves performance by up to 14.2x (9.9x on average) and
reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art
multi-core implementation. NATSA also improves performance by 6.3x and reduces
energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.Comment: To appear in the 38th IEEE International Conference on Computer
Design (ICCD 2020
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Near-Data-Processing (NDP) architectures present a promising way to alleviate
data movement costs and can provide significant performance and energy benefits
to parallel applications. Typically, NDP architectures support several NDP
units, each including multiple simple cores placed close to memory. To fully
leverage the benefits of NDP and achieve high performance for parallel
workloads, efficient synchronization among the NDP cores of a system is
necessary. However, supporting synchronization in many NDP systems is
challenging because they lack shared caches and hardware cache coherence
support, which are commonly used for synchronization in multicore systems, and
communication across different NDP units can be expensive.
This paper comprehensively examines the synchronization problem in NDP
systems, and proposes SynCron, an end-to-end synchronization solution for NDP
systems. SynCron adds low-cost hardware support near memory for synchronization
acceleration, and avoids the need for hardware cache coherence support. SynCron
has three components: 1) a specialized cache memory structure to avoid memory
accesses for synchronization and minimize latency overheads, 2) a hierarchical
message-passing communication protocol to minimize expensive communication
across NDP units of the system, and 3) a hardware-only overflow management
scheme to avoid performance degradation when hardware resources for
synchronization tracking are exceeded.
We evaluate SynCron using a variety of parallel workloads, covering various
contention scenarios. SynCron improves performance by 1.27 on average
(up to 1.78) under high-contention scenarios, and by 1.35 on
average (up to 2.29) under low-contention real applications, compared
to state-of-the-art approaches. SynCron reduces system energy consumption by
2.08 on average (up to 4.25).Comment: To appear in the 27th IEEE International Symposium on
High-Performance Computer Architecture (HPCA-27
SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations
Important workloads, such as machine learning and graph analytics
applications, heavily involve sparse linear algebra operations. These
operations use sparse matrix compression as an effective means to avoid storing
zeros and performing unnecessary computation on zero elements. However,
compression techniques like Compressed Sparse Row (CSR) that are widely used
today introduce significant instruction overhead and expensive pointer-chasing
operations to discover the positions of the non-zero elements. In this paper,
we identify the discovery of the positions (i.e., indexing) of non-zero
elements as a key bottleneck in sparse matrix-based workloads, which greatly
reduces the benefits of compression. We propose SMASH, a hardware-software
cooperative mechanism that enables highly-efficient indexing and storage of
sparse matrices. The key idea of SMASH is to explicitly enable the hardware to
recognize and exploit sparsity in data. To this end, we devise a novel software
encoding based on a hierarchy of bitmaps. This encoding can be used to
efficiently compress any sparse matrix, regardless of the extent and structure
of sparsity. At the same time, the bitmap encoding can be directly interpreted
by the hardware. We design a lightweight hardware unit, the Bitmap Management
Unit (BMU), that buffers and scans the bitmap hierarchy to perform
highly-efficient indexing of sparse matrices. SMASH exposes an expressive and
rich ISA to communicate with the BMU, which enables its use in accelerating any
sparse matrix computation. We demonstrate the benefits of SMASH on four use
cases that include sparse matrix kernels and graph analytics applications
Robust estimation of bacterial cell count from optical density
Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressing E. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals <1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data