2,759 research outputs found
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural
language processing (NLP). Yet, what `good generalisation' entails and how it
should be evaluated is not well understood, nor are there any common standards
to evaluate it. In this paper, we aim to lay the ground-work to improve both of
these issues. We present a taxonomy for characterising and understanding
generalisation research in NLP, we use that taxonomy to present a comprehensive
map of published generalisation studies, and we make recommendations for which
areas might deserve attention in the future. Our taxonomy is based on an
extensive literature review of generalisation research, and contains five axes
along which studies can differ: their main motivation, the type of
generalisation they aim to solve, the type of data shift they consider, the
source by which this data shift is obtained, and the locus of the shift within
the modelling pipeline. We use our taxonomy to classify over 400 previous
papers that test generalisation, for a total of more than 600 individual
experiments. Considering the results of this review, we present an in-depth
analysis of the current state of generalisation research in NLP, and make
recommendations for the future. Along with this paper, we release a webpage
where the results of our review can be dynamically explored, and which we
intend to up-date as new NLP generalisation studies are published. With this
work, we aim to make steps towards making state-of-the-art generalisation
testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference
InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems
In the field of artificial intelligence for science, it is consistently an
essential challenge to face a limited amount of labeled data for real-world
problems. The prevailing approach is to pretrain a powerful task-agnostic model
on a large unlabeled corpus but may struggle to transfer knowledge to
downstream tasks. In this study, we propose InstructMol, a semi-supervised
learning algorithm, to take better advantage of unlabeled examples. It
introduces an instructor model to provide the confidence ratios as the
measurement of pseudo-labels' reliability. These confidence scores then guide
the target model to pay distinct attention to different data points, avoiding
the over-reliance on labeled data and the negative influence of incorrect
pseudo-annotations. Comprehensive experiments show that InstructBio
substantially improves the generalization ability of molecular models, in not
only molecular property predictions but also activity cliff estimations,
demonstrating the superiority of the proposed method. Furthermore, our evidence
indicates that InstructBio can be equipped with cutting-edge pretraining
methods and used to establish large-scale and task-specific pseudo-labeled
molecular datasets, which reduces the predictive errors and shortens the
training process. Our work provides strong evidence that semi-supervised
learning can be a promising tool to overcome the data scarcity limitation and
advance molecular representation learning
A Survey on In-context Learning
With the increasing ability of large language models (LLMs), in-context
learning (ICL) has become a new paradigm for natural language processing (NLP),
where LLMs make predictions only based on contexts augmented with a few
examples. It has been a new trend to explore ICL to evaluate and extrapolate
the ability of LLMs. In this paper, we aim to survey and summarize the progress
and challenges of ICL. We first present a formal definition of ICL and clarify
its correlation to related studies. Then, we organize and discuss advanced
techniques, including training strategies, demonstration designing strategies,
as well as related analysis. Finally, we discuss the challenges of ICL and
provide potential directions for further research. We hope that our work can
encourage more research on uncovering how ICL works and improving ICL.Comment: Papers collected until 2023/05/2
- …