6,438 research outputs found
Learning to Skim Text
Recurrent Neural Networks are showing much promise in many sub-areas of
natural language processing, ranging from document classification to machine
translation to automatic question answering. Despite their promise, many
recurrent models have to read the whole text word by word, making it slow to
handle long documents. For example, it is difficult to use a recurrent network
to read a book and answer questions about it. In this paper, we present an
approach of reading text while skipping irrelevant information if needed. The
underlying model is a recurrent network that learns how far to jump after
reading a few words of the input text. We employ a standard policy gradient
method to train the model to make discrete jumping decisions. In our benchmarks
on four different tasks, including number prediction, sentiment analysis, news
article classification and automatic Q\&A, our proposed model, a modified LSTM
with jumping, is up to 6 times faster than the standard sequential LSTM, while
maintaining the same or even better accuracy
Does GNN Pretraining Help Molecular Representation?
Extracting informative representations of molecules using Graph neural
networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph
research community has been trying to replicate the success of self-supervised
pretraining in natural language processing, with several successes claimed.
However, we find the benefit brought by self-supervised pretraining on small
molecular data can be negligible in many cases. We conduct thorough ablation
studies on the key components of GNN pretraining, including pretraining
objectives, data splitting methods, input features, pretraining dataset scales,
and GNN architectures, to see how they affect the accuracy of the downstream
tasks. Our first important finding is, self-supervised graph pretraining do not
always have statistically significant advantages over non-pretraining methods
in many settings. Secondly, although noticeable improvement can be observed
with additional supervised pretraining, the improvement may diminish with
richer features or more balanced data splits. Thirdly, hyper-parameters could
have larger impacts on accuracy of downstream tasks than the choice of
pretraining tasks, especially when the scales of downstream tasks are small.
Finally, we provide our conjectures where the complexity of some pretraining
methods on small molecules might be insufficient, followed by empirical
evidences on different pretraining datasets
- …