Using memory located on remote machines, or far memory, as a swap space is a
promising approach to meet the increasing memory demands of modern datacenter
applications. Operating systems have long relied on prefetchers to mask the
increased latency of fetching pages from swap space to main memory.
Unfortunately, with traditional prefetching heuristics, performance still
degrades when applications use far memory. In this paper we propose a new
prefetching technique for far-memory applications. We focus our efforts on
memory-intensive, oblivious applications whose memory access patterns are
independent of their inputs, such as matrix multiplication. For this class of
applications we observe that we can perfectly prefetch pages without relying on
heuristics. However, prefetching perfectly without requiring significant
application modifications is challenging.
In this paper we describe the design and implementation of 3PO, a system that
provides pre-planned prefetching for general oblivious applications. We
demonstrate that 3PO can accelerate applications, e.g., running them 30-150%
faster than with Linux's prefetcher with 20% local memory. We also use 3PO to
understand the fundamental software overheads of prefetching in a paging-based
system, and the minimum performance penalty that they impose when we run
applications under constrained local memory.Comment: 14 page