685 research outputs found
Nonlinear spiked covariance matrices and signal propagation in deep neural networks
Many recent works have studied the eigenvalue spectrum of the Conjugate
Kernel (CK) defined by the nonlinear feature map of a feedforward neural
network. However, existing results only establish weak convergence of the
empirical eigenvalue distribution, and fall short of providing precise
quantitative characterizations of the ''spike'' eigenvalues and eigenvectors
that often capture the low-dimensional signal structure of the learning
problem. In this work, we characterize these signal eigenvalues and
eigenvectors for a nonlinear version of the spiked covariance model, including
the CK as a special case. Using this general result, we give a quantitative
description of how spiked eigenstructure in the input data propagates through
the hidden layers of a neural network with random weights. As a second
application, we study a simple regime of representation learning where the
weight matrix develops a rank-one signal component over training and
characterize the alignment of the target function with the spike eigenvector of
the CK on test data.Comment: 55 page
Teaching Large Language Models to Self-Debug
Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any feedback on the code correctness or error
messages, the model is able to identify its mistakes by explaining the
generated code in natural language. Self-Debugging achieves the
state-of-the-art performance on several code generation benchmarks, including
the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python
translation, and MBPP for text-to-Python generation. On the Spider benchmark
where there are no unit tests to verify the correctness of predictions,
Self-Debugging with code explanation consistently improves the baseline by
2-3%, and improves the prediction accuracy on problems of the hardest label by
9%. On TransCoder and MBPP where unit tests are available, Self-Debugging
improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback
messages and reusing failed predictions, Self-Debugging notably improves sample
efficiency, and can match or outperform baseline models that generate more than
10x candidate programs
Recitation-Augmented Language Models
We propose a new paradigm to help Large Language Models (LLMs) generate more
accurate factual knowledge without retrieving from an external corpus, called
RECITation-augmented gEneration (RECITE). Different from retrieval-augmented
language models that retrieve relevant documents before generating the outputs,
given an input, RECITE first recites one or several relevant passages from
LLMs' own memory via sampling, and then produces the final answers. We show
that RECITE is a powerful paradigm for knowledge-intensive NLP tasks.
Specifically, we show that by utilizing recitation as the intermediate step, a
recite-and-answer scheme can achieve new state-of-the-art performance in
various closed-book question answering (CBQA) tasks. In experiments, we verify
the effectiveness of RECITE on three pre-trained models (PaLM, UL2, and OPT)
and three CBQA tasks (Natural Questions, TriviaQA, and HotpotQA)
Large Language Models as Tool Makers
Recent research has highlighted the potential of large language models (LLMs)
to improve their problem-solving capabilities with the aid of suitable external
tools. In our work, we further advance this concept by introducing a
closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs
create their own reusable tools for problem-solving. Our approach consists of
two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for
a set of tasks. 2) tool using: another LLM acts as the tool user, which applies
the tool built by the tool maker for problem-solving. On the problem-solving
server side, tool-making enables continual tool generation and caching as new
requests emerge. This framework enables subsequent requests to access cached
tools via their corresponding APIs, enhancing the efficiency of task
resolution. Recognizing that tool-making requires more sophisticated
capabilities, we assign this task to a powerful, albeit resource-intensive,
model. Conversely, the simpler tool-using phase is delegated to a lightweight
model. This strategic division of labor allows the once-off cost of tool-making
to be spread over multiple instances of tool-using, significantly reducing
average costs while maintaining strong performance. Furthermore, our method
offers a functional cache through the caching and reuse of tools, which stores
the functionality of a class of requests instead of the natural language
responses from LLMs, thus extending the applicability of the conventional cache
mechanism. We evaluate our approach across various complex reasoning tasks,
including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool
user, LATM demonstrates performance equivalent to using GPT-4 for both roles,
but with a significantly reduced inference cost.Comment: Code available at https://github.com/ctlllll/LLM-ToolMake
TEMPERA: Test-Time Prompting via Reinforcement Learning
Careful prompt design is critical to the use of large language models in
zero-shot or few-shot learning. As a consequence, there is a growing interest
in automated methods to design optimal prompts. In this work, we propose
Test-time Prompt Editing using Reinforcement learning (TEMPERA). In contrast to
prior prompt generation methods, TEMPERA can efficiently leverage prior
knowledge, is adaptive to different queries and provides an interpretable
prompt for every query. To achieve this, we design a novel action space that
allows flexible editing of the initial prompts covering a wide set of
commonly-used components like instructions, few-shot exemplars, and
verbalizers. The proposed method achieves significant gains compared with
recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a
variety of tasks including sentiment analysis, topic classification, natural
language inference, and reading comprehension. Our method achieves 5.33x on
average improvement in sample efficiency when compared to the traditional
fine-tuning methods
Instruction-Following Evaluation for Large Language Models
One core capability of Large Language Models (LLMs) is to follow natural
language instructions. However, the evaluation of such abilities is not
standardized: Human evaluations are expensive, slow, and not objectively
reproducible, while LLM-based auto-evaluation is potentially biased or limited
by the ability of the evaluator LLM. To overcome these issues, we introduce
Instruction-Following Eval (IFEval) for large language models. IFEval is a
straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set
of "verifiable instructions" such as "write in more than 400 words" and
"mention the keyword of AI at least 3 times". We identified 25 types of those
verifiable instructions and constructed around 500 prompts, with each prompt
containing one or more verifiable instructions. We show evaluation results of
two widely available LLMs on the market. Our code and data can be found at
https://github.com/google-research/google-research/tree/master/instruction_following_eva
- …