422 research outputs found
On Implicit Bias in Overparameterized Bilevel Optimization
Many problems in machine learning involve bilevel optimization (BLO),
including hyperparameter optimization, meta-learning, and dataset distillation.
Bilevel problems consist of two nested sub-problems, called the outer and inner
problems, respectively. In practice, often at least one of these sub-problems
is overparameterized. In this case, there are many ways to choose among optima
that achieve equivalent objective values. Inspired by recent studies of the
implicit bias induced by optimization algorithms in single-level optimization,
we investigate the implicit bias of gradient-based algorithms for bilevel
optimization. We delineate two standard BLO methods -- cold-start and
warm-start -- and show that the converged solution or long-run behavior depends
to a large degree on these and other algorithmic choices, such as the
hypergradient approximation. We also show that the inner solutions obtained by
warm-start BLO can encode a surprising amount of information about the outer
objective, even when the outer parameters are low-dimensional. We believe that
implicit bias deserves as central a role in the study of bilevel optimization
as it has attained in the study of single-level neural net optimization.Comment: ICML 202
Learning Neural Point Processes with Latent Graphs
Neural point processes (NPPs) employ neural networks to capture complicated dynamics of asynchronous event sequences. Existing NPPs feed all history events into neural networks, assuming that all event types contribute to the prediction of the target type. How- ever, this assumption can be problematic because in reality some event types do not contribute to the predictions of another type. To correct this defect, we learn to omit those types of events that do not contribute to the prediction of one target type during the formulation of NPPs. Towards this end, we simultaneously consider the tasks of (1) finding event types that contribute to predictions of the target types and (2) learning a NPP model from event se- quences. For the former, we formulate a latent graph, with event types being vertices and non-zero contributing relationships being directed edges; then we propose a probabilistic graph generator, from which we sample a latent graph. For the latter, the sampled graph can be readily used as a plug-in to modify an existing NPP model. Because these two tasks are nested, we propose to optimize the model parameters through bilevel programming, and develop an efficient solution based on truncated gradient back-propagation. Experimental results on both synthetic and real-world datasets show the improved performance against state-of-the-art baselines. This work removes disturbance of non-contributing event types with the aid of a validation procedure, similar to the practice to mitigate overfitting used when training machine learning models
Embarassingly Simple Dataset Distillation
Dataset distillation extracts a small set of synthetic training samples from
a large dataset with the goal of achieving competitive performance on test data
when trained on this sample. In this work, we tackle dataset distillation at
its core by treating it directly as a bilevel optimization problem.
Re-examining the foundational back-propagation through time method, we study
the pronounced variance in the gradients, computational burden, and long-term
dependencies. We introduce an improved method: Random Truncated Backpropagation
Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation
coupled with a random window, effectively stabilizing the gradients and
speeding up the optimization while covering long dependencies. This allows us
to establish new state-of-the-art for a variety of standard dataset benchmarks.
A deeper dive into the nature of distilled data unveils pronounced
intercorrelation. In particular, subsets of distilled datasets tend to exhibit
much worse performance than directly distilled smaller datasets of the same
size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates
distilled datasets that contain subsets with near optimal performance across
different data budgets.Comment: Short version appears at NeurIPS 2023 WANT worksho
- …