Markov state models (MSMs) are a widely used method for approximating the
eigenspectrum of the molecular dynamics propagator, yielding insight into the
long-timescale statistical kinetics and slow dynamical modes of biomolecular
systems. However, the lack of a unified theoretical framework for choosing
between alternative models has hampered progress, especially for non-experts
applying these methods to novel biological systems. Here, we consider
cross-validation with a new objective function for estimators of these slow
dynamical modes, a generalized matrix Rayleigh quotient (GMRQ), which measures
the ability of a rank-m projection operator to capture the slow subspace of
the system. It is shown that a variational theorem bounds the GMRQ from above
by the sum of the first m eigenvalues of the system's propagator, but that
this bound can be violated when the requisite matrix elements are estimated
subject to statistical uncertainty. This overfitting can be detected and
avoided through cross-validation. These result make it possible to construct
Markov state models for protein dynamics in a way that appropriately captures
the tradeoff between systematic and statistical errors