69 research outputs found
Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models
International audienceWe consider the structured-output prediction problem through probabilistic approaches and generalize the “perturb-and-MAP” framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for log-supermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use double stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on medium-scale character recognition and image segmentation, showing the benefits of our algorithms
Parameter Learning for Log-supermodular Distributions
International audienceWe consider log-supermodular models on binary variables, which are probabilistic models with negative log-densities which are submodular. These models provide probabilistic interpretations of common combinatorial optimization tasks such as image segmentation. In this paper, we focus primarily on parameter estimation in the models from known upper-bounds on the intractable log-partition function. We show that the bound based on separable optimization on the base polytope of the submodular function is always inferior to a bound based on " perturb-and-MAP " ideas. Then, to learn parameters, given that our approximation of the log-partition function is an expectation (over our own randomization), we use a stochastic subgradient technique to maximize a lower-bound on the log-likelihood. This can also be extended to conditional maximum likelihood. We illustrate our new results in a set of experiments in binary image denoising, where we highlight the flexibility of a probabilistic model to learn with missing data
Barrier Frank-Wolfe for Marginal Inference
We introduce a globally-convergent algorithm for optimizing the
tree-reweighted (TRW) variational objective over the marginal polytope. The
algorithm is based on the conditional gradient method (Frank-Wolfe) and moves
pseudomarginals within the marginal polytope through repeated maximum a
posteriori (MAP) calls. This modular structure enables us to leverage black-box
MAP solvers (both exact and approximate) for variational inference, and obtains
more accurate results than tree-reweighted algorithms that optimize over the
local consistency relaxation. Theoretically, we bound the sub-optimality for
the proposed algorithm despite the TRW objective having unbounded gradients at
the boundary of the marginal polytope. Empirically, we demonstrate the
increased quality of results found by tightening the relaxation over the
marginal polytope as well as the spanning tree polytope on synthetic and
real-world instances.Comment: 25 pages, 12 figures, To appear in Neural Information Processing
Systems (NIPS) 2015, Corrected reference and cleaned up bibliograph
Greedy Bayesian Posterior Approximation with Deep Ensembles
Ensembles of independently trained neural networks are a state-of-the-art
approach to estimate predictive uncertainty in Deep Learning, and can be
interpreted as an approximation of the posterior distribution via a mixture of
delta functions. The training of ensembles relies on non-convexity of the loss
landscape and random initialization of their individual members, making the
resulting posterior approximation uncontrolled. This paper proposes a novel and
principled method to tackle this limitation, minimizing an -divergence
between the true posterior and a kernel density estimator in a function space.
We analyze this objective from a combinatorial point of view, and show that it
is submodular with respect to mixture components for any . Subsequently, we
consider the problem of ensemble construction, and from the marginal gain of
the total objective, we derive a novel diversity term for training ensembles
greedily. The performance of our approach is demonstrated on computer vision
out-of-distribution detection benchmarks in a range of architectures trained on
multiple datasets. The source code of our method is publicly available at
https://github.com/MIPT-Oulu/greedy_ensembles_training
- …