Search CORE

3 research outputs found

Feature Map Transform Coding for Energy-Efficient CNN Inference

Author: Banner Ron
Baskin Chaim
Bronstein Alex M.
Chmiel Brian
Karbachevsky Alex
Mendelson Avi
Yermolin Yevgeny
Zheltonozhskii Evgenii
Publication venue
Publication date: 26/09/2019
Field of study

Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a variety of tasks in computer vision and beyond. One of the major obstacles hindering the ubiquitous use of CNNs for inference on low-power edge devices is their high computational complexity and memory bandwidth requirements. The latter often dominates the energy footprint on modern hardware. In this paper, we introduce a lossy transform coding approach, inspired by image and video compression, designed to reduce the memory bandwidth due to the storage of intermediate activation calculation results. Our method does not require fine-tuning the network weights and halves the data transfer volumes to the main memory by compressing feature maps, which are highly correlated, with variable length coding. Our method outperform previous approach in term of the number of bits per value with minor accuracy degradation on ResNet-34 and MobileNetV2. We analyze the performance of our approach on a variety of CNN architectures and demonstrate that FPGA implementation of ResNet-18 with our approach results in a reduction of around 40% in the memory energy footprint, compared to quantized network, with negligible impact on accuracy. When allowing accuracy degradation of up to 2%, the reduction of 60% is achieved. A reference implementation is available at https://github.com/CompressTeam/TransformCodingInferenc

arXiv.org e-Print Archive

Robust Quantization: One Model to Rule Them All

Author: Banner Ron
Bronstein Alex
Chmiel Brian
Nahshan Yury
Shkolnik Moran
Shomron Gil
Weiser Uri
Publication venue
Publication date: 17/06/2020
Field of study

Neural network quantization methods often involve simulating the quantization process during training, making the trained model highly dependent on the target bit-width and precise way quantization is performed. Robust quantization offers an alternative approach with improved tolerance to different classes of data-types and quantization policies. It opens up new exciting applications where the quantization process is not static and can vary to meet different circumstances and implementations. To address this issue, we propose a method that provides intrinsic robustness to the model against a broad range of quantization processes. Our method is motivated by theoretical arguments and enables us to store a single generic model capable of operating at various bit-widths and quantization policies. We validate our method's effectiveness on different ImageNet models

arXiv.org e-Print Archive

Inference time evaluation and quantization error analysis on the Edge TPU processor

Author: Villarrubia Elvira Jorge
Publication venue
Publication date: 01/02/2023
Field of study

Trabajo Fin de Máster en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2022/2023.El auge de las redes neuronales ha motivado que aparezcan arquitecturas de propósito especifico para sus cómputos. Los cálculos tensoriales que predominan en estas redes se pueden realizar de forma eficiente mediante unidades de procesamiento tensorial (TPUs). Es habitual que las inferencias de las redes estén sujetas a estrictas restricciones de tiempo y, para reducir su latencia en entornos IoT, se utilizan TPUs para computación en el borde (edge computing). En este trabajo se estudia el rendimiento del procesador Edge TPU, diseñado por Google para este tipo de computación. Dicho procesador realiza las inferencias con aritmética entera de 8 bits, lo que produce importantes beneficios en cuanto a rendimiento y eficiencia energética. No obstante, el uso de precisión reducida requiere la cuantización del modelo, que introduce cierto error en la inferencia. En este trabajo también se analiza el error provocado por la cuantización para modelos entrenados mediante aprendizaje por refuerzo. El tamaño de memoria interna del Edge TPU (8 MiB) es insuficiente para almacenar modelos que no son excesivamente grandes. Si un modelo no cabe completamente en esta memoria, una parte se almacena en el host y, durante la inferencia, se realizan envíos a la TPU que degradan notablemente el rendimiento. Este cuello de botella se alivia considerablemente segmentando el modelo para ejecutar los fragmentos en un pipeline de TPUs. Frente al uso de una sola TPU, la segmentación con hasta cuatro de ellas ha producido mejoras de rendimiento de ×6 en capas neuronales de convolución y casi ×50 en capas neuronales densas. Por otra parte, se observa y justifica la influencia que tiene sobre el error por cuantización la anchura de la distribución de pesos en relaciona a la dispersión de sus valores. Además, para varias arquitecturas de red, se aprecian los mismos patrones de evolución del error con el avance del entrenamiento. También se observa el impacto de este error en la recompensa obtenida por el modelo cuantizado frente al modelo sin cuantizar. Finalmente, se aprecia y justifica que un escalado en profundidad de la red neuronal (añadirle más capas neuronales) aumente notablemente el error por cuantización.The rise of neural networks has led to the emergence of specific-purpose architectures for their computations. Tensor computations that dominate these networks can be efficiently performed by tensor processing units (TPUs). It is common for network inferences to be subject to strict time constraints and TPUs are used for edge computing to reduce latency in IoT environments. In this work we study the performance of the Edge TPU processor, designed by Google specifically for edge computing. This processor performs inference with 8 bit integer arithmetic, which yields significant performance and energy efficiency benefits. However, the use of reduced precision requires model quantization, that can potentially introduce numerical errors in the inference. This work also analyses the error caused by quantization for models trained by reinforcement learning. The internal memory size of the Edge TPU (8 MiB) is insufficient to store models that are not excessively large. If a model does not fit completely in this memory, a portion is stored on the host and, during inference, dispatches are made to the TPU that degrade performance significantly. This bottleneck is alleviated considerably by segmenting the model to run the fragments in a pipeline of TPUs. Compared to using a single TPU, segmenting with up to four of them has yielded performance improvements of 6× in convolutional neural layers and almost 50× on dense neural layers. On the other hand, the influence that the width of the weight distribution has on the quantization error in relation to the dispersion of its values is observed and justified. Moreover, for several network architectures, the same patterns of error evolution are observed as training progresses. The impact of this error on the reward obtained by the quantized model versus the unquantized model is also observed. Finally, it is observed and justified that a deep scaling of the neural network (adding more neural layers) significantly increases the quantization error.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu