Presto is an open-source distributed SQL query engine for OLAP, aiming for
"SQL on everything". Since open-sourced in 2013, Presto has been consistently
gaining popularity in large-scale data analytics and attracting adoption from a
wide range of enterprises. From the development and operation of Presto, we
witnessed a significant amount of CPU consumption on parsing column-oriented
data files in Presto worker nodes. This blocks some companies, including Meta,
from increasing analytical data volumes.
In this paper, we present a metadata caching layer, built on top of the
Alluxio SDK cache and incorporated in each Presto worker node, to cache the
intermediate results in file parsing. The metadata cache provides two caching
methods: caching the decompressed metadata bytes from raw data files and
caching the deserialized metadata objects. Our evaluation of the TPC-DS
benchmark on Presto demonstrates that when the cache is warm, the first method
can reduce the query's CPU consumption by 10%-20%, whereas the second method
can minimize the CPU usage by 20%-40%.Comment: 5 pages, 8 figure