Large language models show an emergent ability to learn a new task from a
small number of input-output demonstrations. However, recent work shows that
in-context learners largely rely on their pre-trained knowledge, such as the
sentiment of the labels, instead of finding new associations in the input.
However, the commonly-used few-shot evaluation settings using a random
selection of in-context demonstrations can not disentangle models' ability to
learn a new skill from demonstrations, as most of the randomly-selected
demonstrations do not present relations informative for prediction beyond
exposing the new task distribution.
To disentangle models' in-context learning ability independent of models'
memory, we introduce a Conceptual few-shot learning method selecting the
demonstrations sharing a possibly-informative concept with the predicted
sample. We extract a set of such concepts from annotated explanations and
measure how much can models benefit from presenting these concepts in few-shot
demonstrations.
We find that smaller models are more sensitive to the presented concepts.
While some of the models are able to benefit from concept-presenting
demonstrations for each assessed concept, we find that none of the assessed
in-context learners can benefit from all presented reasoning concepts
consistently, leaving the in-context concept learning an open challenge