Recent years have witnessed remarkable advances in spatiotemporal predictive
learning, incorporating auxiliary inputs, elaborate neural architectures, and
sophisticated training strategies. Although impressive, the system complexity
of mainstream methods is increasing as well, which may hinder the convenient
applications. This paper proposes SimVP, a simple spatiotemporal predictive
baseline model that is completely built upon convolutional networks without
recurrent architectures and trained by common mean squared error loss in an
end-to-end fashion. Without introducing any extra tricks and strategies, SimVP
can achieve superior performance on various benchmark datasets. To further
improve the performance, we derive variants with the gated spatiotemporal
attention translator from SimVP that can achieve better performance. We
demonstrate that SimVP has strong generalization and extensibility on
real-world datasets through extensive experiments. The significant reduction in
training cost makes it easier to scale to complex scenarios. We believe SimVP
can serve as a solid baseline to benefit the spatiotemporal predictive learning
community