Inductive reasoning is a core problem-solving capacity: humans can identify
underlying principles from a few examples, which can then be robustly
generalized to novel scenarios. Recent work has evaluated large language models
(LLMs) on inductive reasoning tasks by directly prompting them yielding "in
context learning." This can work well for straightforward inductive tasks, but
performs very poorly on more complex tasks such as the Abstraction and
Reasoning Corpus (ARC). In this work, we propose to improve the inductive
reasoning ability of LLMs by generating explicit hypotheses at multiple levels
of abstraction: we prompt the LLM to propose multiple abstract hypotheses about
the problem, in natural language, then implement the natural language
hypotheses as concrete Python programs. These programs can be directly verified
by running on the observed examples and generalized to novel inputs. Because of
the prohibitive cost of generation with state-of-the-art LLMs, we consider a
middle step to filter the set of hypotheses that will be implemented into
programs: we either ask the LLM to summarize into a smaller set of hypotheses,
or ask human annotators to select a subset of the hypotheses. We verify our
pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its
variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem
subset of ARC, our automated pipeline using LLM summaries achieves 27.5%
accuracy, significantly outperforming the direct prompting baseline (accuracy
of 12.5%). With the minimal human input of selecting from LLM-generated
candidates, the performance is boosted to 37.5%. (And we argue this is a lower
bound on the performance of our approach without filtering.) Our ablation
studies show that abstract hypothesis generation and concrete program
representations are both beneficial for LLMs to perform inductive reasoning
tasks