3 research outputs found

    Feature Map Transform Coding for Energy-Efficient CNN Inference

    Full text link
    Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a variety of tasks in computer vision and beyond. One of the major obstacles hindering the ubiquitous use of CNNs for inference on low-power edge devices is their high computational complexity and memory bandwidth requirements. The latter often dominates the energy footprint on modern hardware. In this paper, we introduce a lossy transform coding approach, inspired by image and video compression, designed to reduce the memory bandwidth due to the storage of intermediate activation calculation results. Our method does not require fine-tuning the network weights and halves the data transfer volumes to the main memory by compressing feature maps, which are highly correlated, with variable length coding. Our method outperform previous approach in term of the number of bits per value with minor accuracy degradation on ResNet-34 and MobileNetV2. We analyze the performance of our approach on a variety of CNN architectures and demonstrate that FPGA implementation of ResNet-18 with our approach results in a reduction of around 40% in the memory energy footprint, compared to quantized network, with negligible impact on accuracy. When allowing accuracy degradation of up to 2%, the reduction of 60% is achieved. A reference implementation is available at https://github.com/CompressTeam/TransformCodingInferenc

    Robust Quantization: One Model to Rule Them All

    Full text link
    Neural network quantization methods often involve simulating the quantization process during training, making the trained model highly dependent on the target bit-width and precise way quantization is performed. Robust quantization offers an alternative approach with improved tolerance to different classes of data-types and quantization policies. It opens up new exciting applications where the quantization process is not static and can vary to meet different circumstances and implementations. To address this issue, we propose a method that provides intrinsic robustness to the model against a broad range of quantization processes. Our method is motivated by theoretical arguments and enables us to store a single generic model capable of operating at various bit-widths and quantization policies. We validate our method's effectiveness on different ImageNet models

    Inference time evaluation and quantization error analysis on the Edge TPU processor

    Get PDF
    Trabajo Fin de M谩ster en Ingenier铆a Inform谩tica, Facultad de Inform谩tica UCM, Departamento de Arquitectura de Computadores y Autom谩tica, Curso 2022/2023.El auge de las redes neuronales ha motivado que aparezcan arquitecturas de prop贸sito especifico para sus c贸mputos. Los c谩lculos tensoriales que predominan en estas redes se pueden realizar de forma eficiente mediante unidades de procesamiento tensorial (TPUs). Es habitual que las inferencias de las redes est茅n sujetas a estrictas restricciones de tiempo y, para reducir su latencia en entornos IoT, se utilizan TPUs para computaci贸n en el borde (edge computing). En este trabajo se estudia el rendimiento del procesador Edge TPU, dise帽ado por Google para este tipo de computaci贸n. Dicho procesador realiza las inferencias con aritm茅tica entera de 8 bits, lo que produce importantes beneficios en cuanto a rendimiento y eficiencia energ茅tica. No obstante, el uso de precisi贸n reducida requiere la cuantizaci贸n del modelo, que introduce cierto error en la inferencia. En este trabajo tambi茅n se analiza el error provocado por la cuantizaci贸n para modelos entrenados mediante aprendizaje por refuerzo. El tama帽o de memoria interna del Edge TPU (8 MiB) es insuficiente para almacenar modelos que no son excesivamente grandes. Si un modelo no cabe completamente en esta memoria, una parte se almacena en el host y, durante la inferencia, se realizan env铆os a la TPU que degradan notablemente el rendimiento. Este cuello de botella se alivia considerablemente segmentando el modelo para ejecutar los fragmentos en un pipeline de TPUs. Frente al uso de una sola TPU, la segmentaci贸n con hasta cuatro de ellas ha producido mejoras de rendimiento de 脳6 en capas neuronales de convoluci贸n y casi 脳50 en capas neuronales densas. Por otra parte, se observa y justifica la influencia que tiene sobre el error por cuantizaci贸n la anchura de la distribuci贸n de pesos en relaciona a la dispersi贸n de sus valores. Adem谩s, para varias arquitecturas de red, se aprecian los mismos patrones de evoluci贸n del error con el avance del entrenamiento. Tambi茅n se observa el impacto de este error en la recompensa obtenida por el modelo cuantizado frente al modelo sin cuantizar. Finalmente, se aprecia y justifica que un escalado en profundidad de la red neuronal (a帽adirle m谩s capas neuronales) aumente notablemente el error por cuantizaci贸n.The rise of neural networks has led to the emergence of specific-purpose architectures for their computations. Tensor computations that dominate these networks can be efficiently performed by tensor processing units (TPUs). It is common for network inferences to be subject to strict time constraints and TPUs are used for edge computing to reduce latency in IoT environments. In this work we study the performance of the Edge TPU processor, designed by Google specifically for edge computing. This processor performs inference with 8 bit integer arithmetic, which yields significant performance and energy efficiency benefits. However, the use of reduced precision requires model quantization, that can potentially introduce numerical errors in the inference. This work also analyses the error caused by quantization for models trained by reinforcement learning. The internal memory size of the Edge TPU (8 MiB) is insufficient to store models that are not excessively large. If a model does not fit completely in this memory, a portion is stored on the host and, during inference, dispatches are made to the TPU that degrade performance significantly. This bottleneck is alleviated considerably by segmenting the model to run the fragments in a pipeline of TPUs. Compared to using a single TPU, segmenting with up to four of them has yielded performance improvements of 6脳 in convolutional neural layers and almost 50脳 on dense neural layers. On the other hand, the influence that the width of the weight distribution has on the quantization error in relation to the dispersion of its values is observed and justified. Moreover, for several network architectures, the same patterns of error evolution are observed as training progresses. The impact of this error on the reward obtained by the quantized model versus the unquantized model is also observed. Finally, it is observed and justified that a deep scaling of the neural network (adding more neural layers) significantly increases the quantization error.Depto. de Arquitectura de Computadores y Autom谩ticaFac. de Inform谩ticaTRUEunpu
    corecore