25 research outputs found
Abstraction in decision-makers with limited information processing capabilities
A distinctive property of human and animal intelligence is the ability to
form abstractions by neglecting irrelevant information which allows to separate
structure from noise. From an information theoretic point of view abstractions
are desirable because they allow for very efficient information processing. In
artificial systems abstractions are often implemented through computationally
costly formations of groups or clusters. In this work we establish the relation
between the free-energy framework for decision making and rate-distortion
theory and demonstrate how the application of rate-distortion for
decision-making leads to the emergence of abstractions. We argue that
abstractions are induced due to a limit in information processing capacity.Comment: Presented at the NIPS 2013 Workshop on Planning with Information
Constraint
An information-theoretic on-line update principle for perception-action coupling
Inspired by findings of sensorimotor coupling in humans and animals, there
has recently been a growing interest in the interaction between action and
perception in robotic systems [Bogh et al., 2016]. Here we consider perception
and action as two serial information channels with limited
information-processing capacity. We follow [Genewein et al., 2015] and
formulate a constrained optimization problem that maximizes utility under
limited information-processing capacity in the two channels. As a solution we
obtain an optimal perceptual channel and an optimal action channel that are
coupled such that perceptual information is optimized with respect to
downstream processing in the action module. The main novelty of this study is
that we propose an online optimization procedure to find bounded-optimal
perception and action channels in parameterized serial perception-action
systems. In particular, we implement the perceptual channel as a multi-layer
neural network and the action channel as a multinomial distribution. We
illustrate our method in a NAO robot simulator with a simplified cup lifting
task.Comment: 8 pages, 2017 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS
A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
We propose a novel Bayesian approach to solve stochastic optimization
problems that involve finding extrema of noisy, nonlinear functions. Previous
work has focused on representing possible functions explicitly, which leads to
a two-step procedure of first, doing inference over the function space and
second, finding the extrema of these functions. Here we skip the representation
step and directly model the distribution over extrema. To this end, we devise a
non-parametric conjugate prior based on a kernel regressor. The resulting
posterior distribution directly captures the uncertainty over the maximum of
the unknown function. We illustrate the effectiveness of our model by
optimizing a noisy, high-dimensional, non-convex objective function.Comment: 9 pages, 5 figure
Grandmaster-Level Chess Without Search
The recent breakthrough successes in machine learning are mainly attributed
to scale: namely large-scale attention-based architectures and datasets of
unprecedented scale. This paper investigates the impact of training at scale
for chess. Unlike traditional chess engines that rely on complex heuristics,
explicit search, or a combination of both, we train a 270M parameter
transformer model with supervised learning on a dataset of 10 million chess
games. We annotate each board in the dataset with action-values provided by the
powerful Stockfish 16 engine, leading to roughly 15 billion data points. Our
largest model reaches a Lichess blitz Elo of 2895 against humans, and
successfully solves a series of challenging chess puzzles, without any
domain-specific tweaks or explicit search algorithms. We also show that our
model outperforms AlphaZero's policy and value networks (without MCTS) and
GPT-3.5-turbo-instruct. A systematic investigation of model and dataset size
shows that strong chess performance only arises at sufficient scale. To
validate our results, we perform an extensive series of ablations of design
choices and hyperparameters
Learning Universal Predictors
Meta-learning has emerged as a powerful approach to train neural networks to
learn new tasks quickly from limited data. Broad exposure to different tasks
leads to versatile representations enabling general problem solving. But, what
are the limits of meta-learning? In this work, we explore the potential of
amortizing the most powerful universal predictor, namely Solomonoff Induction
(SI), into neural networks via leveraging meta-learning to its limits. We use
Universal Turing Machines (UTMs) to generate training data used to expose
networks to a broad range of patterns. We provide theoretical analysis of the
UTM data generation processes and meta-training protocols. We conduct
comprehensive experiments with neural architectures (e.g. LSTMs, Transformers)
and algorithmic data generators of varying complexity and universality. Our
results suggest that UTM data is a valuable resource for meta-learning, and
that it can be used to train neural networks capable of learning universal
prediction strategies.Comment: 32 pages, 11 figure
Language Modeling Is Compression
It has long been established that predictive models can be transformed into
lossless compressors and vice versa. Incidentally, in recent years, the machine
learning community has focused on training increasingly large and powerful
self-supervised (language) models. Since these large language models exhibit
impressive predictive capabilities, they are well-positioned to be strong
compressors. In this work, we advocate for viewing the prediction problem
through the lens of compression and evaluate the compression capabilities of
large (foundation) models. We show that large language models are powerful
general-purpose predictors and that the compression viewpoint provides novel
insights into scaling laws, tokenization, and in-context learning. For example,
Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to
43.4% and LibriSpeech samples to 16.4% of their raw size, beating
domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
Finally, we show that the prediction-compression equivalence allows us to use
any compressor (like gzip) to build a conditional generative model
Meta-learning of Sequential Strategies
In this report we review memory-based meta-learning as a tool for building
sample-efficient strategies that learn from past experience to adapt to any
task within a target class. Our goal is to equip the reader with the conceptual
foundations of this tool for building new, scalable agents that operate on
broad domains. To do so, we present basic algorithmic templates for building
near-optimal predictors and reinforcement learners which behave as if they had
a probabilistic model that allowed them to efficiently exploit task structure.
Furthermore, we recast memory-based meta-learning within a Bayesian framework,
showing that the meta-learned strategies are near-optimal because they amortize
Bayes-filtered data, where the adaptation is implemented in the memory dynamics
as a state-machine of sufficient statistics. Essentially, memory-based
meta-learning translates the hard problem of probabilistic sequential inference
into a regression problem.Comment: DeepMind Technical Report (15 pages, 6 figures