4 research outputs found
Teaching Large Language Models to Reason with Reinforcement Learning
Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a
dominant approach for aligning LLM outputs with human preferences. Inspired by
the success of RLHF, we study the performance of multiple algorithms that learn
from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}),
Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate
both sparse and dense rewards provided to the LLM both heuristically and via a
learned reward model. We additionally start from multiple model sizes and
initializations both with and without supervised fine-tuning (\textbf{SFT})
data. Overall, we find all algorithms perform comparably, with Expert Iteration
performing best in most cases. Surprisingly, we find the sample complexity of
Expert Iteration is similar to that of PPO, requiring at most on the order of
samples to converge from a pretrained checkpoint. We investigate why
this is the case, concluding that during RL training models fail to explore
significantly beyond solutions already produced by SFT models. Additionally, we
discuss a trade off between maj@1 and pass@96 metric performance during SFT
training and how conversely RL training improves both simultaneously. We then
conclude by discussing the implications of our findings for RLHF and the future
role of RL in LLM fine-tuning
Stable Code Technical Report
We introduce Stable Code, the first in our new-generation of code language
models series, which serves as a general-purpose base code language model
targeting code completion, reasoning, math, and other software
engineering-based tasks. Additionally, we introduce an instruction variant
named Stable Code Instruct that allows conversing with the model in a natural
chat interface for performing question-answering and instruction-based tasks.
In this technical report, we detail the data and training procedure leading to
both models. Their weights are available via Hugging Face for anyone to
download and use at https://huggingface.co/stabilityai/stable-code-3b and
https://huggingface.co/stabilityai/stable-code-instruct-3b. This report
contains thorough evaluations of the models, including multilingual programming
benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time
of its release, Stable Code is the state-of-the-art open model under 3B
parameters and even performs comparably to larger models of sizes 7 billion and
15 billion parameters on the popular Multi-PL benchmark. Stable Code Instruct
also exhibits state-of-the-art performance on the MT-Bench coding tasks and on
Multi-PL completion compared to other instruction tuned models. Given its
appealing small size, we also provide throughput measurements on a number of
edge devices. In addition, we open source several quantized checkpoints and
provide their performance metrics compared to the original model
Stable LM 2 1.6B Technical Report
We introduce StableLM 2 1.6B, the first in a new generation of our language
model series. In this technical report, we present in detail the data and
training procedure leading to the base and instruction-tuned versions of
StableLM 2 1.6B. The weights for both models are available via Hugging Face for
anyone to download and use. The report contains thorough evaluations of these
models, including zero- and few-shot benchmarks, multilingual benchmarks, and
the MT benchmark focusing on multi-turn dialogues. At the time of publishing
this report, StableLM 2 1.6B was the state-of-the-art open model under 2B
parameters by a significant margin. Given its appealing small size, we also
provide throughput measurements on a number of edge devices. In addition, we
open source several quantized checkpoints and provide their performance metrics
compared to the original model.Comment: 23 pages, 6 figure
Formal appreciation of art
Our objective is to construct a simple computation model of perception in order to
express a subjective comparison between two objects based on their extrinsic ability
to be perceived in the eyes of an observer and their effect on the observer’s world
perception. The basis of the comparison we’re interested in is often referred to as
beauty, salience, interestingness, or aesthetic preference, yet we concede the com-
pleteness with which these notions are tasked to deal in favor of a core, conceptual
formalism. We express perception’s end goal to be making short descriptions of
objects within some language and formalize this process with equality saturation.
We examine mechanisms aiding the improvement of language to keep descriptions
short and how it relates to perceived objects’ relative worthiness given an observer’s
language and history of experience