22 research outputs found
Hybrid prototyping of multicore embedded systems
Multicore platforms are becoming increasingly pervasive in modern embedded systems. System level modeling techniques have enabled creation of fast software models of multicore platforms, commonly known as Virtual Prototypes, for early functional validation of embedded software, before the hardware is available. On the other hand, for accurate performance validation, the complete multicore platform can be implemented as a physical prototype on FPGA. Both virtual platforms and FPGA prototypes have their respective pros and cons. Virtual platforms have the advantage of high speed functional simulation and, typically, scale well with the number of cores. However, the accuracy of performance estimation is sacrificed. FPGA prototypes provide cycle-accurate performance estimation, because the software executes directly on an FPGA implementation of the target cores. However, it takes a significant amount of time to design, implement and test the inter-core communication architecture on the FPGA.
In this thesis we propose to design a novel system-level modeling framework, called Hybrid Prototyping. Our goal is to provide the benefits of both virtual platforms and FPGA prototypes. It aims to provide early, fast, and scalable models, similar to virtual platforms, along with the cycle-accuracy of FPGA prototypes. Using hybrid prototyping, embedded software designers will be able to create concurrent applications and accurately analyze the performance implication of their optimizations before the chip is delivered. At the same time, multicore architects will be able to modify the platform model without having to do full system prototyping. Therefore, hybrid prototyping will enable early and reliable multicore embedded system design, resulting in huge productivity gains for both embedded software designers and multicore chip architects
DeepliteRT: Computer Vision at the Edge
The proliferation of edge devices has unlocked unprecedented opportunities
for deep learning model deployment in computer vision applications. However,
these complex models require considerable power, memory and compute resources
that are typically not available on edge platforms. Ultra low-bit quantization
presents an attractive solution to this problem by scaling down the model
weights and activations from 32-bit to less than 8-bit. We implement highly
optimized ultra low-bit convolution operators for ARM-based targets that
outperform existing methods by up to 4.34x. Our operator is implemented within
Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation,
tuning, and inference of ultra low-bit models on ARM devices. Compiler passes
in DeepliteRT automatically convert a fake-quantized model in full precision to
a compact ultra low-bit representation, easing the process of quantized model
deployment on commodity hardware. We analyze the performance of DeepliteRT on
classification and detection models against optimized 32-bit floating-point,
8-bit integer, and 2-bit baselines, achieving significant speedups of up to
2.20x, 2.33x and 2.17x, respectively.Comment: Accepted at British Machine Vision Conference (BMVC) 202
Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime
Deep Learning has been one of the most disruptive technological advancements
in recent times. The high performance of deep learning models comes at the
expense of high computational, storage and power requirements. Sensing the
immediate need for accelerating and compressing these models to improve
on-device performance, we introduce Deeplite Neutrino for production-ready
optimization of the models and Deeplite Runtime for deployment of ultra-low bit
quantized models on Arm-based platforms. We implement low-level quantization
kernels for Armv7 and Armv8 architectures enabling deployment on the vast array
of 32-bit and 64-bit Arm-based devices. With efficient implementations using
vectorization, parallelization, and tiling, we realize speedups of up to 2x and
2.2x compared to TensorFlow Lite with XNNPACK backend on classification and
detection models, respectively. We also achieve significant speedups of up to
5x and 3.2x compared to ONNX Runtime for classification and detection models,
respectively