119 research outputs found

    MultiQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width Network Quantization

    Full text link
    Arbitrary bit-width network quantization has received significant attention due to its high adaptability to various bit-width requirements during runtime. However, in this paper, we investigate existing methods and observe a significant accumulation of quantization errors caused by frequent bit-width switching of weights and activations, leading to limited performance. To address this issue, we propose MultiQuant, a novel method that utilizes a multi-branch topology for arbitrary bit-width quantization. MultiQuant duplicates the network body into multiple independent branches and quantizes the weights of each branch to a fixed 2-bit while retaining the input activations in the expected bit-width. This approach maintains the computational cost as the same while avoiding the switching of weight bit-widths, thereby substantially reducing errors in weight quantization. Additionally, we introduce an amortization branch selection strategy to distribute quantization errors caused by activation bit-width switching among branches to enhance performance. Finally, we design an in-place distillation strategy that facilitates guidance between branches to further enhance MultiQuant's performance. Extensive experiments demonstrate that MultiQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods. Code is at \url{https://github.com/zysxmu/MultiQuant}

    HEQuant: Marrying Homomorphic Encryption and Quantization for Communication-Efficient Private Inference

    Full text link
    Secure two-party computation with homomorphic encryption (HE) protects data privacy with a formal security guarantee but suffers from high communication overhead. While previous works, e.g., Cheetah, Iron, etc, have proposed efficient HE-based protocols for different neural network (NN) operations, they still assume high precision, e.g., fixed point 37 bit, for the NN operations and ignore NNs' native robustness against quantization error. In this paper, we propose HEQuant, which features low-precision-quantization-aware optimization for the HE-based protocols. We observe the benefit of a naive combination of quantization and HE quickly saturates as bit precision goes down. Hence, to further improve communication efficiency, we propose a series of optimizations, including an intra-coefficient packing algorithm and a quantization-aware tiling algorithm, to simultaneously reduce the number and precision of the transferred data. Compared with prior-art HE-based protocols, e.g., CrypTFlow2, Cheetah, Iron, etc, HEQuant achieves 3.5∼23.4×3.5\sim 23.4\times communication reduction and 3.0∼9.3×3.0\sim 9.3\times latency reduction. Meanwhile, when compared with prior-art network optimization frameworks, e.g., SENet, SNL, etc, HEQuant also achieves 3.1∼3.6×3.1\sim 3.6\times communication reduction

    Sequence-To-Sequence Neural Networks Inference on Embedded Processors Using Dynamic Beam Search

    Get PDF
    Sequence-to-sequence deep neural networks have become the state of the art for a variety of machine learning applications, ranging from neural machine translation (NMT) to speech recognition. Many mobile and Internet of Things (IoT) applications would benefit from the ability of performing sequence-to-sequence inference directly in embedded devices, thereby reducing the amount of raw data transmitted to the cloud, and obtaining benefits in terms of response latency, energy consumption and security. However, due to the high computational complexity of these models, specific optimization techniques are needed to achieve acceptable performance and energy consumption on single-core embedded processors. In this paper, we present a new optimization technique called dynamic beam search, in which the inference complexity is tuned to the difficulty of the processed input sequence at runtime. Results based on measurements on a real embedded device, and on three state-of-the-art deep learning models, show that our method is able to reduce the inference time and energy by up to 25% without loss of accuracy

    PD-Quant: Post-Training Quantization based on Prediction Difference Metric

    Full text link
    Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep neural networks, it can also introduce quantization noise and reduce prediction accuracy, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Existing methods attempt to determine these parameters by minimize the distance between features before and after quantization, but such an approach only considers local information and may not result in the most optimal quantization parameters. We analyze this issue and ropose PD-Quant, a method that addresses this limitation by considering global information. It determines the quantization parameters by using the information of differences between network prediction before and after quantization. In addition, PD-Quant can alleviate the overfitting problem in PTQ caused by the small number of calibration sets by adjusting the distribution of activations. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.14% and RegNetX-600MF up to 40.67% in weight 2-bit activation 2-bit. The code is released at https://github.com/hustvl/PD-Quant

    Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

    Get PDF
    In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in high-speed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Additionally, we evaluate the influence these strategies have on the performance of each NN equalizer. Quantization, weight clustering, pruning, and other cutting-edge strategies for model compression are taken into consideration. In this work, we propose and evaluate a Bayesian optimization-assisted compression, in which the hyperparameters of the compression are chosen to simultaneously reduce complexity and improve performance. Next, this paper presents four distinct metrics (RMpS, BoP, NABS, and NLGs) that are discussed here that can be used to evaluate the amount of computing complexity required by various compression algorithms. These measurements can serve as a benchmark for evaluating the relative effectiveness of various NN equalizers when compression approaches are used. In conclusion, the trade-off between the complexity of each compression approach and its performance is evaluated by utilizing both simulated and experimental data in order to complete the analysis. By utilizing optimal compression approaches, we show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. This is accomplished by reducing the number of multipliers used in the NN equalizer after applying the weighted clustering and pruning algorithms. Furthermore, we demonstrate that an equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block. We conclude our analysis by highlighting open questions and existing challenges, as well as possible future research directions

    Un marco de aprendizaje mutuo para redes podadas y cuantificadas

    Get PDF
    Model compression is an important topic in deep learning research. It can be mainly divided into two directions: model pruning and model quantization. However, both methods will more or less affect the original accuracy of the model. In this paper, we propose a mutual learning framework for pruned and quantized networks. We regard the pruned network and the quantized network as two sets of features that are not parallel. The purpose of our mutual learning framework is to better integrate the two sets of features and achieve complementary advantages, which we call feature augmentation. To verify the effectiveness of our framework, we select a pairwise combination of 3 state-of-the-art pruning algorithms and 3 state-of-theart quantization algorithms. Extensive experiments on CIFAR-10, CIFAR-100 and Tiny-imagenet show the benefits of our framework: through the mutual learning of the two networks, we obtain a pruned network and a quantization network with higher accuracy than traditional approaches.La compresión de modelos es un tema importante en la investigación del aprendizaje profundo. Se puede dividir principalmente en dos direcciones: poda de modelos y cuantización de modelos. Sin embargo, ambos métodos afectarán más o menos la precisión original del modelo. En este artículo, proponemos un marco de aprendizaje mutuo para redes podadas y cuantificadas. Consideramos la red podada y la red quantized como dos conjuntos de características que no son paralelas. El propósito de nuestro marco de aprendizaje mutuo es integrar mejor los dos conjuntos de funciones y lograr ventajas complementarias, lo que llamamos aumento de funciones. Para verificar la efectividad de nuestro marco, seleccionamos una combinación por pares de 3 algoritmos de poda de última generación y 3 algoritmos de cuantificación de última generación. Extensos experimentos en CIFAR- 10, CIFAR-100 y Tiny-imagenet muestran los beneficios de nuestro marco: a través del aprendizaje mutuo de las dos redes, obtenemos una red pruned y una red de cuantificación con mayor precisión que los enfoques tradicionales.Facultad de Informátic
    • …
    corecore