Structural support vector machines (SSVMs) are amongst the best performing
models for structured computer vision tasks, such as semantic image
segmentation or human pose estimation. Training SSVMs, however, is
computationally costly, because it requires repeated calls to a structured
prediction subroutine (called \emph{max-oracle}), which has to solve an
optimization problem itself, e.g. a graph cut.
In this work, we introduce a new algorithm for SSVM training that is more
efficient than earlier techniques when the max-oracle is computationally
expensive, as it is frequently the case in computer vision tasks. The main idea
is to (i) combine the recent stochastic Block-Coordinate Frank-Wolfe algorithm
with efficient hyperplane caching, and (ii) use an automatic selection rule for
deciding whether to call the exact max-oracle or to rely on an approximate one
based on the cached hyperplanes.
We show experimentally that this strategy leads to faster convergence to the
optimum with respect to the number of requires oracle calls, and that this
translates into faster convergence with respect to the total runtime when the
max-oracle is slow compared to the other steps of the algorithm.
A publicly available C++ implementation is provided at
http://pub.ist.ac.at/~vnk/papers/SVM.html