5 research outputs found
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Constructing AI models that respond to text instructions is challenging,
especially for sequential decision-making tasks. This work introduces an
instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1,
demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective
for creating instruction-following sequential decision-making agents. STEVE-1
is trained in two steps: adapting the pretrained VPT model to follow commands
in MineCLIP's latent space, then training a prior to predict latent codes from
text. This allows us to finetune VPT through self-supervised behavioral cloning
and hindsight relabeling, bypassing the need for costly human text annotations.
By leveraging pretrained models like VPT and MineCLIP and employing best
practices from text-conditioned image generation, STEVE-1 costs just $60 to
train and can follow a wide range of short-horizon open-ended text and visual
instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction
following in Minecraft with low-level controls (mouse and keyboard) and raw
pixel inputs, far outperforming previous baselines. We provide experimental
evidence highlighting key factors for downstream performance, including
pretraining, classifier-free guidance, and data scaling. All resources,
including our model weights, training scripts, and evaluation tools are made
available for further research