Neural personalized recommendation is the corner-stone of a wide collection
of cloud services and products, constituting significant compute demand of the
cloud infrastructure. Thus, improving the execution efficiency of neural
recommendation directly translates into infrastructure capacity saving. In this
paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that
adopts an algorithm and system co-design methodology to custom-design systems
for recommendation use cases. Leveraging the insights from the recommendation
characterization, a new dynamic scheduler, DeepRecSched, is proposed to
maximize latency-bounded throughput by taking into account characteristics of
inference query size and arrival patterns, recommendation model architectures,
and underlying hardware systems. By doing so, system throughput is doubled
across the eight industry-representative recommendation models. Finally,
design, deployment, and evaluation in at-scale production datacenter shows over
30% latency reduction across a wide variety of recommendation models running on
hundreds of machines