34 research outputs found
Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems
In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can
generate many parallel memory requests at a time. The processing of these
parallel requests in the DRAM controller greatly affects the memory
interference delay experienced by running tasks on the platform. In this paper,
we model a modern COTS multicore system which has a nonblocking last-level
cache (LLC) and a DRAM controller that prioritizes reads over writes. To
minimize interference, we focus on LLC and DRAM bank partitioned systems. Based
on the model, we propose an analysis that computes a safe upper bound for the
worst-case memory interference delay. We validated our analysis on a real COTS
multicore platform with a set of carefully designed synthetic benchmarks as
well as SPEC2006 benchmarks. Evaluation results show that our analysis is more
accurately capture the worst-case memory interference delay and provides safer
upper bounds compared to a recently proposed analysis which significantly
under-estimate the delay.Comment: Technical Repor
Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms
Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Jetson TX2 platform, an integrated CPU-GPU architecture, we observed that, in the worst case, the GPU kernels can suffer as much as 3X slowdown in the presence of co-running memory intensive CPU applications. In this paper, we propose a software mechanism, which we call BWLOCK++, to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications
Analysis and Mitigation of Shared Resource Contention on Heterogeneous Multicore: An Industrial Case Study
In this paper, we address the industrial challenge put forth by ARM in ECRTS
2022. We systematically analyze the effect of shared resource contention to an
augmented reality head-up display (AR-HUD) case-study application of the
industrial challenge on a heterogeneous multicore platform, NVIDIA Jetson Nano.
We configure the AR-HUD application such that it can process incoming image
frames in real-time at 20Hz on the platform. We use micro-architectural
denial-of-service (DoS) attacks as aggressor tasks of the challenge and show
that they can dramatically impact the latency and accuracy of the AR-HUD
application, which results in significant deviations of the estimated
trajectories from the ground truth, despite our best effort to mitigate their
influence by using cache partitioning and real-time scheduling of the AR-HUD
application. We show that dynamic LLC (or DRAM depending on the aggressor)
bandwidth throttling of the aggressor tasks is an effective mean to ensure
real-time performance of the AR-HUD application without resorting to
over-provisioning the system
Protecting Real-Time GPU Kernels on Integrated CPU-GPU SoC Platforms (Artifact)
This artifact is based on BWLOCK++, a software framework to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications in platforms containing integrated GPUs. The artifact is designed to support the claims of the companion paper and contains instructions on how to build and execute BWLOCK++ on a target hardware platform
DeepPicar: A Low-cost Deep Neural Network-based Autonomous Car
We present DeepPicar, a low-cost deep neural network based autonomous car
platform. DeepPicar is a small scale replication of a real self-driving car
called DAVE-2 by NVIDIA. DAVE-2 uses a deep convolutional neural network (CNN),
which takes images from a front-facing camera as input and produces car
steering angles as output. DeepPicar uses the same network architecture---9
layers, 27 million connections and 250K parameters---and can drive itself in
real-time using a web camera and a Raspberry Pi 3 quad-core platform. Using
DeepPicar, we analyze the Pi 3's computing capabilities to support end-to-end
deep learning based real-time control of autonomous vehicles. We also
systematically compare other contemporary embedded computing platforms using
the DeepPicar's CNN-based real-time control workload. We find that all tested
platforms, including the Pi 3, are capable of supporting the CNN-based
real-time control, from 20 Hz up to 100 Hz, depending on hardware platform.
However, we find that shared resource contention remains an important issue
that must be considered in applying CNN models on shared memory based embedded
computing platforms; we observe up to 11.6X execution time increase in the CNN
based control loop due to shared resource contention. To protect the CNN
workload, we also evaluate state-of-the-art cache partitioning and memory
bandwidth throttling techniques on the Pi 3. We find that cache partitioning is
ineffective, while memory bandwidth throttling is an effective solution.Comment: To be published as a conference paper at RTCSA 201
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
NVDLA is an open-source deep neural network (DNN) accelerator which has
received a lot of attention by the community since its introduction by Nvidia.
It is a full-featured hardware IP and can serve as a good reference for
conducting research and development of SoCs with integrated accelerators.
However, an expensive FPGA board is required to do experiments with this IP in
a real SoC. Moreover, since NVDLA is clocked at a lower frequency on an FPGA,
it would be hard to do accurate performance analysis with such a setup. To
overcome these limitations, we integrate NVDLA into a real RISC-V SoC on the
Amazon cloud FPGA using FireSim, a cycle-exact FPGA-accelerated simulator. We
then evaluate the performance of NVDLA by running YOLOv3 object-detection
algorithm. Our results show that NVDLA can sustain 7.5 fps when running YOLOv3.
We further analyze the performance by showing that sharing the last-level cache
with NVDLA can result in up to 1.56x speedup. We then identify that sharing the
memory system with the accelerator can result in unpredictable execution time
for the real-time tasks running on this platform. We believe this is an
important issue that must be addressed in order for on-chip DNN accelerators to
be incorporated in real-time embedded systems.Comment: Presented at the 2nd Workshop on Energy Efficient Machine Learning
and Cognitive Computing for Embedded Applications (EMC2'19