4 research outputs found
Sculpting Efficiency: Pruning Medical Imaging Models for On-Device Inference
Applying ML advancements to healthcare can improve patient outcomes. However,
the sheer operational complexity of ML models, combined with legacy hardware
and multi-modal gigapixel images, poses a severe deployment limitation for
real-time, on-device inference. We consider filter pruning as a solution,
exploring segmentation models in cardiology and ophthalmology. Our preliminary
results show a compression rate of up to 1148x with minimal loss in quality,
stressing the need to consider task complexity and architectural details when
using off-the-shelf models. At high compression rates, filter-pruned models
exhibit faster inference on a CPU than the GPU baseline. We also demonstrate
that such models' robustness and generalisability characteristics exceed that
of the baseline and weight-pruned counterparts. We uncover intriguing questions
and take a step towards realising cost-effective disease diagnosis, monitoring,
and preventive solutions
HASS:Hardware-aware sparsity search for dataflow DNN accelerator
Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism.Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs.In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3× to 4.2× compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: https://github.com/Yu-Zhewen/HAS
Exploring the Relative Value of Collaborative Optimisation Pathways (Student Abstract)
Compression techniques in machine learning (ML) independently improve a model’s inference efficiency by reducing its memory footprint while aiming to maintain its quality. This paper lays groundwork in questioning the merit of a compression pipeline involving all techniques as opposed to skipping a few by considering a case study on a keyword spotting model: DS-CNN-S. In addition, it documents improvements to the model’s training and dataset infrastructure. For this model, preliminary findings suggest that a full-scale pipeline isn’t required to achieve a competent memory footprint and accuracy, but a more comprehensive study is required