Corra: Correlation-Aware Column Compression

Kipf, Andreas; Liu, Hanwen; Stoian, Mihail; van Renen, Alexander

text

Corra: Correlation-Aware Column Compression

Authors: Andreas Kipf
Hanwen Liu
Mihail Stoian
Alexander van Renen
Publication date: 17 June 2024
Publisher

Abstract

Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair (

\texttt{city}

, \texttt{zip_code}). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call