55 research outputs found
E-BATCH: Energy-efficient and high-throughput RNN batching
Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may vastly differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short time span, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of an input sequence is done. Hence, a new input sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. E-BATCH improves throughput by 1.8× and energy efficiency by 3.6× in E-PUR, whereas in TPU, it improves throughput by 2.1× and energy efficiency by 1.6×, over the state-of-the-art.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (author's final draft
Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs
Deploying deep learning models in cloud clusters provides efficient and
prompt inference services to accommodate the widespread application of deep
learning. These clusters are usually equipped with host CPUs and accelerators
with distinct responsibilities to handle serving requests, i.e. generalpurpose
CPUs for input preprocessing and domain-specific GPUs for forward computation.
Recurrent neural networks play an essential role in handling temporal inputs
and display distinctive computation characteristics because of their high
inter-operator parallelism. Hence, we propose Chrion to optimize recurrent
neural network inference by collaboratively utilizing CPUs and GPUs. We
formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling
problem of directed acyclic graphs on heterogeneous devices. Given an input
model in the ONNX format and user-defined SLO requirement, Chrion firstly
preprocesses the model by model parsing and profiling, and then partitions the
graph to select execution devices for each operator. When an online request
arrives, Chrion performs forward computation according to the graph partition
by executing the operators on the CPU and GPU in parallel. Our experimental
results show that the execution time can be reduced by 19.4% at most in the
latency-optimal pattern and GPU memory footprint by 67.5% in the memory-optimal
pattern compared with the execution on the GPU
Reconfigurable acceleration of Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency.
This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs.
The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs.
The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs.
The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection.
To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces
CRIME: Input-Dependent Collaborative Inference for Recurrent Neural Networks
The excellent accuracy of Recurrent Neural Networks (RNNs) for time-series and natural language processing comes at the cost of computational complexity. Therefore, the choice between edge and cloud computing for RNN inference, with the goal of minimizing response time or energy consumption, is not trivial. An edge approach must deal with the aforementioned complexity, while a cloud solution pays large time and energy costs for data transmission. Collaborative inference is a technique that tries to obtain the best of both worlds, by splitting the inference task among a network of collaborating devices. While already investigated for other types of neural networks, collaborative inference for RNNs poses completely new challenges, such as the strong influence of input length on processing time and energy, and is greatly unexplored.In this paper, we introduce a Collaborative RNN Inference Mapping Engine(CRIME), which automatically selects the best inference device for each input. CRIME is flexible with respect to the connection topology among collaborating devices, and adapts to changes in the connections statuses and in the devices loads. With experiments on several RNNs and datasets, we show that CRIME can reduce the execution time (or end-node energy) by more than 25% compared to any single-device approach
The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Large Language Models (LLMs) like GPT are state-of-the-art text generation
models that provide significant assistance in daily routines. However, LLM
execution is inherently sequential, since they only produce one token at a
time, thus incurring low hardware utilization on modern GPUs. Batching and
speculative decoding are two techniques to improve GPU hardware utilization in
LLM inference. To study their synergy, we implement a prototype implementation
and perform an extensive characterization analysis on various LLM models and
GPU architectures. We observe that the optimal speculation length depends on
the batch size used. We analyze the key observation and build a quantitative
model to explain it. Based on our analysis, we propose a new adaptive
speculative decoding strategy that chooses the optimal speculation length for
different batch sizes. Our evaluations show that our proposed method can
achieve equal or better performance than the state-of-the-art speculation
decoding schemes with fixed speculation length
- …