93 research outputs found
Improving reference mining in patents with BERT
In this paper we address the challenge of extracting scientific references
from patents. We approach the problem as a sequence labelling task and
investigate the merits of BERT models to the extraction of these long
sequences. References in patents to scientific literature are relevant to study
the connection between science and industry. Most prior work only uses the
front-page citations for this analysis, which are provided in the metadata of
patent archives. In this paper we build on prior work using Conditional Random
Fields (CRF) and Flair for reference extraction. We improve the quality of the
training data and train three BERT-based models on the labelled data (BERT,
bioBERT, sciBERT). We find that the improved training data leads to a large
improvement in the quality of the trained models. In addition, the BERT models
beat CRF and Flair, with recall scores around 97% obtained with cross
validation. With the best model we label a large collection of 33 thousand
patents, extract the citations, and match them to publications in the Web of
Science database. We extract 50% more references than with the old training
data and methods: 735 thousand references in total. With these
patent-publication links, follow-up research will further analyze which types
of scientific work lead to inventions.Comment: 10 pages, 3 figure
The reach of commercially motivated junk news on Facebook
Commercially motivated junk news -- i.e. money-driven, highly shareable
clickbait with low journalistic production standards -- constitutes a vast and
largely unexplored news media ecosystem. Using publicly available Facebook
data, we compared the reach of junk news on Facebook pages in the Netherlands
to the reach of Dutch mainstream news on Facebook. During the period 2013-2017
the total number of user interactions with junk news significantly exceeded
that with mainstream news. Over 5 Million of the 10 Million Dutch Facebook
users have interacted with a junk news post at least once. Junk news Facebook
pages also had a significantly stronger increase in the number of user
interactions over time than mainstream news. Since the beginning of 2016 the
average number of user interactions per junk news post has consistently
exceeded the average number of user interactions per mainstream news post.Comment: 18 pages, 4 figures, submitted pre-prin
The merits of Universal Language Model Fine-tuning for Small Datasets -- a case with Dutch book reviews
We evaluated the effectiveness of using language models, that were
pre-trained in one domain, as the basis for a classification model in another
domain: Dutch book reviews. Pre-trained language models have opened up new
possibilities for classification tasks with limited labelled data, because
representation can be learned in an unsupervised fashion. In our experiments we
have studied the effects of training set size (100-1600 items) on the
prediction accuracy of a ULMFiT classifier, based on a language models that we
pre-trained on the Dutch Wikipedia. We also compared ULMFiT to Support Vector
Machines, which is traditionally considered suitable for small collections. We
found that ULMFiT outperforms SVM for all training set sizes and that
satisfactory results (~90%) can be achieved using training sets that can be
manually annotated within a few hours. We deliver both our new benchmark
collection of Dutch book reviews for sentiment classification as well as the
pre-trained Dutch language model to the community.Comment: 5 pages, 2 figure
Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain
The amount of archaeological literature is growing rapidly. Until recently,
these data were only accessible through metadata search. We implemented a text
retrieval engine for a large archaeological text collection ( Million
words). In archaeological IR, domain-specific entities such as locations, time
periods, and artefacts, play a central role. This motivated the development of
a named entity recognition (NER) model to annotate the full collection with
archaeological named entities. In this paper, we present ArcheoBERTje, a BERT
model pre-trained on Dutch archaeological texts. We compare the model's quality
and output on a Named Entity Recognition task to a generic multilingual model
and a generic Dutch model. We also investigate ensemble methods for combining
multiple BERT models, and combining the best BERT model with a domain thesaurus
using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms
both the multilingual and Dutch model significantly with a smaller standard
deviation between runs, reaching an average F1 score of 0.735. The model also
outperforms ensemble methods combining the three models. Combining ArcheoBERTje
predictions and explicit domain knowledge from the thesaurus did not increase
the F1 score. We quantitatively and qualitatively analyse the differences
between the vocabulary and output of the BERT models on the full collection and
provide some valuable insights in the effect of fine-tuning for specific
domains. Our results indicate that for a highly specific text domain such as
archaeology, further pre-training on domain-specific data increases the model's
quality on NER by a much larger margin than shown for other domains in the
literature, and that domain-specific pre-training makes the addition of domain
knowledge from a thesaurus unnecessary
Retrieval for Extremely Long Queries and Documents with RPRS: a Highly Efficient and Effective Transformer-based Re-Ranker
Retrieval with extremely long queries and documents is a well-known and
challenging task in information retrieval and is commonly known as
Query-by-Document (QBD) retrieval. Specifically designed Transformer models
that can handle long input sequences have not shown high effectiveness in QBD
tasks in previous work. We propose a Re-Ranker based on the novel Proportional
Relevance Score (RPRS) to compute the relevance score between a query and the
top-k candidate documents. Our extensive evaluation shows RPRS obtains
significantly better results than the state-of-the-art models on five different
datasets. Furthermore, RPRS is highly efficient since all documents can be
pre-processed, embedded, and indexed before query time which gives our
re-ranker the advantage of having a complexity of O(N) where N is the total
number of sentences in the query and candidate documents. Furthermore, our
method solves the problem of the low-resource training in QBD retrieval tasks
as it does not need large amounts of training data, and has only three
parameters with a limited range that can be optimized with a grid search even
if a small amount of labeled data is available. Our detailed analysis shows
that RPRS benefits from covering the full length of candidate documents and
queries.Comment: Accepted at ACM Transactions on Information Systems (ACM TOIS
journal
Improving BERT-based query-by-document retrieval with multi-task optimization
Query-by-document (QBD) retrieval is an Information Retrieval task in which a seed document acts as the query and the goal is to retrieve related documents – it is particular common in professional search tasks. In this work we improve the retrieval effectiveness of the BERT re-ranker, proposing an extension to its fine-tuning step to better exploit the context of queries. To this end, we use an additional document-level representation learning objective besides the ranking objective when fine-tuning the BERT re-ranker. Our experiments on two QBD retrieval benchmarks show that the proposed multi-task optimization significantly improves the ranking effectiveness without changing the BERT re-ranker or using additional training samples. In future work, the generalizability of our approach to other retrieval tasks should be further investigated
Fine-grained Affective Processing Capabilities Emerging from Large Language Models
Large language models, in particular generative pre-trained transformers
(GPTs), show impressive results on a wide variety of language-related tasks. In
this paper, we explore ChatGPT's zero-shot ability to perform affective
computing tasks using prompting alone. We show that ChatGPT a) performs
meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions,
b) has meaningful emotion representations in terms of emotion categories and
these affective dimensions, and c) can perform basic appraisal-based emotion
elicitation of situations based on a prompt-based computational implementation
of the OCC appraisal model. These findings are highly relevant: First, they
show that the ability to solve complex affect processing tasks emerges from
language-based token prediction trained on extensive data sets. Second, they
show the potential of large language models for simulating, processing and
analyzing human emotions, which has important implications for various
applications such as sentiment analysis, socially interactive agents, and
social robotics
- …