Embedding tables dominate industrial-scale recommendation model sizes, using
up to terabytes of memory. A popular and the largest publicly available machine
learning MLPerf benchmark on recommendation data is a Deep Learning
Recommendation Model (DLRM) trained on a terabyte of click-through data. It
contains 100GB of embedding memory (25+Billion parameters). DLRMs, due to their
sheer size and the associated volume of data, face difficulty in training,
deploying for inference, and memory bottlenecks due to large embedding tables.
This paper analyzes and extensively evaluates a generic parameter sharing setup
(PSS) for compressing DLRM models. We show theoretical upper bounds on the
learnable memory requirements for achieving (1±ϵ) approximations
to the embedding table. Our bounds indicate exponentially fewer parameters
suffice for good accuracy. To this end, we demonstrate a PSS DLRM reaching
10000× compression on criteo-tb without losing quality. Such a
compression, however, comes with a caveat. It requires 4.5 × more
iterations to reach the same saturation quality. The paper argues that this
tradeoff needs more investigations as it might be significantly favorable.
Leveraging the small size of the compressed model, we show a 4.3×
improvement in training latency leading to similar overall training times.
Thus, in the tradeoff between system advantage of a small DLRM model vs. slower
convergence, we show that scales are tipped towards having a smaller DLRM
model, leading to faster inference, easier deployment, and similar training
times