4 research outputs found
CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access
CXL has been the emerging technology for expanding memory for both the host
CPU and device accelerators with load/store interface. Extending memory
coherency to the PCIe root complex makes the codesign more flexible in that you
can access the memory with coherency using your near-device computability.
Since the capacity demand with tolerable latency and bandwidth is growing, we
need to come up with a new hardware-software codesign way to offload the
synthesized memory operations to the CXL endpoint, CXL switch or near CXL root
complex cores like Intel DSA to fetch data; the CPU or accelerators can
calculate other stuff in the backend. On CXL done loading, the data will be put
into L1 if capacity fits, and the in-core ROB will be notified by mailbox and
resume the calculation on the previous hardware context. Since the
distance(timing window) of the load instruction sequence is unknown, a
profiling-guided way of codegening and adaptively updating offloaded code will
be required for a long-running job. We propose to evaluate CXLMemUring the
modified BOOMv3 with added in-core-logic and CXL endpoint access simulation
using CHI, and we will add a weaker RISCV Core near endpoint for code
offloading, and the codegening will be based on program analysis with
traditional profiling guided way
D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs
Hardware accelerators such as GPUs are required for real-time, low-latency
inference with Deep Neural Networks (DNN). However, due to the inherent limits
to the parallelism they can exploit, DNNs often under-utilize the capacity of
today's high-end accelerators. Although spatial multiplexing of the GPU, leads
to higher GPU utilization and higher inference throughput, there remain a
number of challenges. Finding the GPU percentage for right-sizing the GPU for
each DNN through profiling, determining an optimal batching of requests to
balance throughput improvement while meeting application-specific deadlines and
service level objectives (SLOs), and maximizing throughput by appropriately
scheduling DNNs are still significant challenges. This paper introduces a
dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs
to run in the GPU concurrently. To help allocate the appropriate GPU percentage
(we call it the "Knee"), we develop and validate a model that estimates the
parallelism each DNN can utilize. We also develop a lightweight optimization
formulation to find an efficient batch size for each DNN operating with
D-STACK. We bring together our optimizations and our spatio-temporal scheduler
to provide a holistic inference framework. We demonstrate its ability to
provide high throughput while meeting application SLOs. We compare D-STACK with
an ideal scheduler that can allocate the right GPU percentage for every DNN
kernel. D-STACK gets higher than 90 percent throughput and GPU utilization
compared to the ideal scheduler. We also compare D-STACK with other GPU
multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus),
using popular DNN models. Our controlled experiments with multiplexing several
popular DNN models achieve up to 1.6X improvement in GPU utilization and up to
4X improvement in inference throughput