Search CORE

4 research outputs found

Teaching Large Language Models to Reason with Reinforcement Learning

Author: Du Yuqing
Dwivedi-Yu Jane
Hambro Eric
Havrilla Alex
Nalmpantis Christoforos
Raileanu Roberta
Raparthy Sharath Chandra
Sukhbaatar Sainbayar
Zhuravinskyi Maksym
Publication venue
Publication date: 07/03/2024
Field of study

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of

10^6

samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning

arXiv.org e-Print Archive

Stable Code Technical Report

Author: Adithyan Reshinth
Baicoianu James
Bellagente Marco
Cooper Nathan
Datta Ashish
Mahan Dakota
Phung Duy
Pinnaparaju Nikhil
Riquelme Carlos
Tow Jonathan
Zhuravinskyi Maksym
Publication venue
Publication date: 01/04/2024
Field of study

We introduce Stable Code, the first in our new-generation of code language models series, which serves as a general-purpose base code language model targeting code completion, reasoning, math, and other software engineering-based tasks. Additionally, we introduce an instruction variant named Stable Code Instruct that allows conversing with the model in a natural chat interface for performing question-answering and instruction-based tasks. In this technical report, we detail the data and training procedure leading to both models. Their weights are available via Hugging Face for anyone to download and use at https://huggingface.co/stabilityai/stable-code-3b and https://huggingface.co/stabilityai/stable-code-instruct-3b. This report contains thorough evaluations of the models, including multilingual programming benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of its release, Stable Code is the state-of-the-art open model under 3B parameters and even performs comparably to larger models of sizes 7 billion and 15 billion parameters on the popular Multi-PL benchmark. Stable Code Instruct also exhibits state-of-the-art performance on the MT-Bench coding tasks and on Multi-PL completion compared to other instruction tuned models. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model

arXiv.org e-Print Archive

Stable LM 2 1.6B Technical Report

Author: Adithyan Reshinth
Baicoianu James
Bellagente Marco
Brooks Ben
Cooper Nathan
Datta Ashish
Lee Meng
Mahan Dakota
Mostaque Emad
Phung Duy
Pieler Michael
Pinnaparju Nikhil
Riquelme Carlos
Rocha Paulo
Saini Harry
Teufel Hannah
Tow Jonathan
Zanichelli Niccolo
Zhuravinskyi Maksym
Publication venue
Publication date: 27/02/2024
Field of study

We introduce StableLM 2 1.6B, the first in a new generation of our language model series. In this technical report, we present in detail the data and training procedure leading to the base and instruction-tuned versions of StableLM 2 1.6B. The weights for both models are available via Hugging Face for anyone to download and use. The report contains thorough evaluations of these models, including zero- and few-shot benchmarks, multilingual benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of publishing this report, StableLM 2 1.6B was the state-of-the-art open model under 2B parameters by a significant margin. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.Comment: 23 pages, 6 figure

arXiv.org e-Print Archive

Formal appreciation of art

Author: Zhuravinskyi Maksym
Publication venue
Publication date: 01/01/2022
Field of study

Our objective is to construct a simple computation model of perception in order to express a subjective comparison between two objects based on their extrinsic ability to be perceived in the eyes of an observer and their effect on the observer’s world perception. The basis of the comparison we’re interested in is often referred to as beauty, salience, interestingness, or aesthetic preference, yet we concede the com- pleteness with which these notions are tasked to deal in favor of a core, conceptual formalism. We express perception’s end goal to be making short descriptions of objects within some language and formalize this process with equality saturation. We examine mechanisms aiding the improvement of language to keep descriptions short and how it relates to perceived objects’ relative worthiness given an observer’s language and history of experience

Ukrainian Catholic University