The problem of reducing the communication cost in distributed training through gradient quantization is considered. For gradient descent on smooth and strongly convex objective functions on R^n, we characterize the fundamental rate function-the minimum achievable linear convergence rate for a given number of bits per dimension n. We propose Differentially Quantized Gradient Descent, a quantization algorithm with error compensation, and prove that it achieves the rate function as n goes to infinity. In contrast, the naive quantizer that compresses the current gradient directly fails to achieve that optimal tradeoff. Experimental results on both simulated and real-world least-squares problems confirm our theoretical analysis