8,212 research outputs found
Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural
Language Processing (NLP) during the past decade. However, the demands of long
document analysis are quite different from those of shorter texts, while the
ever increasing size of documents uploaded on-line renders automated
understanding of long texts a critical area of research. This article has two
goals: a) it overviews the relevant neural building blocks, thus serving as a
short tutorial, and b) it surveys the state-of-the-art in long document NLP,
mainly focusing on two central tasks: document classification and document
summarization. Sentiment analysis for long texts is also covered, since it is
typically treated as a particular case of document classification.
Additionally, this article discusses the main challenges, issues and current
solutions related to long document NLP. Finally, the relevant, publicly
available, annotated datasets are presented, in order to facilitate further
research.Comment: 53 pages, 2 figures, 171 citation
Attention over pre-trained Sentence Embeddings for Long Document Classification
Despite being the current de-facto models in most NLP tasks, transformers are
often limited to short sequences due to their quadratic attention complexity on
the number of tokens. Several attempts to address this issue were studied,
either by reducing the cost of the self-attention computation or by modeling
smaller sequences and combining them through a recurrence mechanism or using a
new transformer model. In this paper, we suggest to take advantage of
pre-trained sentence transformers to start from semantically meaningful
embeddings of the individual sentences, and then combine them through a small
attention layer that scales linearly with the document length. We report the
results obtained by this simple architecture on three standard document
classification datasets. When compared with the current state-of-the-art models
using standard fine-tuning, the studied method obtains competitive results
(even if there is no clear best model in this configuration). We also showcase
that the studied architecture obtains better results when freezing the
underlying transformers. A configuration that is useful when we need to avoid
complete fine-tuning (e.g. when the same frozen transformer is shared by
different applications). Finally, two additional experiments are provided to
further evaluate the relevancy of the studied architecture over simpler
baselines
Evaluating neural multi-field document representations for patent classification
Patent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the efficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, which points to interesting future directions of research on less computation-intensive models, e.g., by summarizing long documents before neural classification
- …