Empirical risk minimization (ERM) is a fundamental learning rule for
statistical learning problems where the data is generated according to some
unknown distribution P and returns a hypothesis f chosen from a
fixed class F with small loss β. In the parametric setting,
depending upon (β,F,P) ERM can have slow
(1/nβ) or fast (1/n) rates of convergence of the excess risk as a
function of the sample size n. There exist several results that give
sufficient conditions for fast rates in terms of joint properties of β,
F, and P, such as the margin condition and the Bernstein
condition. In the non-statistical prediction with expert advice setting, there
is an analogous slow and fast rate phenomenon, and it is entirely characterized
in terms of the mixability of the loss β (there being no role there for
F or P). The notion of stochastic mixability builds a
bridge between these two models of learning, reducing to classical mixability
in a special case. The present paper presents a direct proof of fast rates for
ERM in terms of stochastic mixability of (β,F,P), and
in so doing provides new insight into the fast-rates phenomenon. The proof
exploits an old result of Kemperman on the solution to the general moment
problem. We also show a partial converse that suggests a characterization of
fast rates for ERM in terms of stochastic mixability is possible.Comment: 21 pages, accepted to NIPS 201