2 research outputs found
Network Pruning for Low-Rank Binary Indexing
Pruning is an efficient model compression technique to remove redundancy in
the connectivity of deep neural networks (DNNs). Computations using sparse
matrices obtained by pruning parameters, however, exhibit vastly different
parallelism depending on the index representation scheme. As a result,
fine-grained pruning has not gained much attention due to its irregular index
form leading to large memory footprint and low parallelism for convolutions and
matrix multiplications. In this paper, we propose a new network pruning
technique that generates a low-rank binary index matrix to compress index data
while decompressing index data is performed by simple binary matrix
multiplication. This proposed compression method finds a particular
fine-grained pruning mask that can be decomposed into two binary matrices. We
also propose a tile-based factorization technique that not only lowers memory
requirements but also enhances compression ratio. Various DNN models can be
pruned with much fewer indexes compared to previous sparse matrix formats while
maintaining the same pruning rate
BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs
The number of parameters in deep neural networks (DNNs) is rapidly increasing
to support complicated tasks and to improve model accuracy. Correspondingly,
the amount of computations and required memory footprint increase as well.
Quantization is an efficient method to address such concerns by compressing
DNNs such that computations can be simplified while required storage footprint
is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully
support quantization because only fixed data transfers (such as 32 bits) are
allowed. As a result, even if weights are quantized into a few bits, CPUs and
GPUs cannot access multiple quantized weights without memory bandwidth waste.
Success of quantization in practice, hence, relies on an efficient computation
engine design, especially for matrix multiplication that is a basic computation
engine in most DNNs. In this paper, we propose a novel matrix multiplication
method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access
multiple quantized weights simultaneously in one instruction. In addition,
BiQGEMM pre-computes intermediate results that are highly redundant when
quantization leads to limited available computation space. Since pre-computed
values are stored in lookup tables and reused, BiQGEMM achieves lower amount of
overall computations. Our extensive experimental results show that BiQGEMM
presents higher performance than conventional schemes when DNNs are quantized.Comment: 13 pages, 12 figure