We present XMem, a video object segmentation architecture for long videos
with unified feature memory stores inspired by the Atkinson-Shiffrin memory
model. Prior work on video object segmentation typically only uses one type of
feature memory. For videos longer than a minute, a single feature memory model
tightly links memory consumption and accuracy. In contrast, following the
Atkinson-Shiffrin model, we develop an architecture that incorporates multiple
independent yet deeply-connected feature memory stores: a rapidly updated
sensory memory, a high-resolution working memory, and a compact thus sustained
long-term memory. Crucially, we develop a memory potentiation algorithm that
routinely consolidates actively used working memory elements into the long-term
memory, which avoids memory explosion and minimizes performance decay for
long-term prediction. Combined with a new memory reading mechanism, XMem
greatly exceeds state-of-the-art performance on long-video datasets while being
on par with state-of-the-art methods (that do not work on long videos) on
short-video datasets. Code is available at https://hkchengrex.github.io/XMemComment: Accepted to ECCV 2022. Project page:
https://hkchengrex.github.io/XMe