13 research outputs found
Nonparametric Uncertainty Quantification for Single Deterministic Neural Network
This paper proposes a fast and scalable method for uncertainty quantification
of machine learning models' predictions. First, we show the principled way to
measure the uncertainty of predictions for a classifier based on
Nadaraya-Watson's nonparametric estimate of the conditional label distribution.
Importantly, the proposed approach allows to disentangle explicitly aleatoric
and epistemic uncertainties. The resulting method works directly in the feature
space. However, one can apply it to any neural network by considering an
embedding of the data induced by the network. We demonstrate the strong
performance of the method in uncertainty estimation tasks on text
classification problems and a variety of real-world image datasets, such as
MNIST, SVHN, CIFAR-100 and several versions of ImageNet.Comment: NeurIPS 2022 pape
Towards Computationally Feasible Deep Active Learning
Active learning (AL) is a prominent technique for reducing the annotation
effort required for training machine learning models. Deep learning offers a
solution for several essential obstacles to deploying AL in practice but
introduces many others. One of such problems is the excessive computational
resources required to train an acquisition model and estimate its uncertainty
on instances in the unlabeled pool. We propose two techniques that tackle this
issue for text classification and tagging tasks, offering a substantial
reduction of AL iteration duration and the computational overhead introduced by
deep acquisition models in AL. We also demonstrate that our algorithm that
leverages pseudo-labeling and distilled models overcomes one of the essential
obstacles revealed previously in the literature. Namely, it was shown that due
to differences between an acquisition model used to select instances during AL
and a successor model trained on the labeled data, the benefits of AL can
diminish. We show that our algorithm, despite using a smaller and faster
acquisition model, is capable of training a more expressive successor model
with higher performance.Comment: Accepted at NAACL-2022 Finding
LM-Polygraph: Uncertainty Estimation for Language Models
Recent advancements in the capabilities of large language models (LLMs) have
paved the way for a myriad of groundbreaking applications in various fields.
However, a significant challenge arises as these models often "hallucinate",
i.e., fabricate facts without providing users an apparent means to discern the
veracity of their statements. Uncertainty estimation (UE) methods are one path
to safer, more responsible, and more effective use of LLMs. However, to date,
research on UE methods for LLMs has been focused primarily on theoretical
rather than engineering contributions. In this work, we tackle this issue by
introducing LM-Polygraph, a framework with implementations of a battery of
state-of-the-art UE methods for LLMs in text generation tasks, with unified
program interfaces in Python. Additionally, it introduces an extendable
benchmark for consistent evaluation of UE techniques by researchers, and a demo
web application that enriches the standard chat dialog with confidence scores,
empowering end-users to discern unreliable responses. LM-Polygraph is
compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and
GPT-4, and is designed to support future releases of similarly-styled LMs.Comment: Accepted at EMNLP-202
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Large language models (LLMs) have demonstrated remarkable capability to
generate fluent responses to a wide variety of user queries, but this has also
resulted in concerns regarding the potential misuse of such texts in
journalism, educational, and academic context. In this work, we aim to develop
automatic systems to identify machine-generated text and to detect potential
misuse. We first introduce a large-scale benchmark M4, which is
multi-generator, multi-domain, and multi-lingual corpus for machine-generated
text detection. Using the dataset, we experiment with a number of methods and
we show that it is challenging for detectors to generalize well on unseen
examples if they are either from different domains or are generated by
different large language models. In such cases, detectors tend to misclassify
machine-generated text as human-written. These results show that the problem is
far from solved and there is a lot of room for improvement. We believe that our
dataset M4, which covers different generators, domains and languages, will
enable future research towards more robust approaches for this pressing
societal problem. The M4 dataset is available at
https://github.com/mbzuai-nlp/M4.Comment: 11 page
Word Sense Disambiguation for 158 Languages using Word Embeddings Only
Disambiguation of word senses in context is easy for humans, but is a major
challenge for automatic approaches. Sophisticated supervised and
knowledge-based models were developed to solve this task. However, (i) the
inherent Zipfian distribution of supervised training instances for a given word
and/or (ii) the quality of linguistic knowledge representations motivate the
development of completely unsupervised and knowledge-free approaches to word
sense disambiguation (WSD). They are particularly useful for under-resourced
languages which do not have any resources for building either supervised and/or
knowledge-based models. In this paper, we present a method that takes as input
a standard pre-trained word embedding model and induces a fully-fledged word
sense inventory, which can be used for disambiguation in context. We use this
method to induce a collection of sense inventories for 158 languages on the
basis of the original pre-trained fastText word embeddings by Grave et al.
(2018), enabling WSD in these languages. Models and system are available
online.Comment: 10 pages, 5 figures, 4 tables, accepted at LREC 202
Towards Text Processing System for Emergency Event Detection in the Arctic Zone
We present the ongoing work on text processing system for detection and analysis of events related to emergencies in the Arctic zone. The peculiarity of the task consists in data sparseness and scarceness of tools / language resources for processing such specific texts. The system performs focused crawling of documents related to emergencies in the Arctic region, text parsing including named entity recognition and geotagging, and indexing texts with their metadata for faceted search. The system aims at processing both English and Russian text messages and documents. We report the preliminary results of the experimental evaluation of the system components on Twitter data
Active learning with deep pre-trained models for sequence tagging of clinical and biomedical texts
Active learning is a technique that helps to minimize the annotation budget required for the creation of a labeled dataset while maximizing the performance of a model trained on this dataset. It has been shown that active learning can be successfully applied to sequence tagging tasks of text processing in conjunction with deep learning models even when a limited amount of labeled data is available. Recent advances in transfer learning methods for natural language processing based on deep pre-trained models such as ELMo and BERT offer a much better ability to generalize on small annotated datasets compared to their shallow counterparts. The combination of deep pre-trained models and active learning leads to a powerful approach to dealing with annotation scarcity. In this work, we investigate the potential of this approach on clinical and biomedical data. The experimental evaluation shows that the combination of active learning and deep pre-trained models outperforms the standard methods of active learning. We also suggest a modification to a standard uncertainty sampling strategy and empirically show that it could be beneficial for annotation of very skewed datasets. Finally, we propose an annotation tool empowered with active learning and deep pre-trained models that could be used for entity annotation directly from Jupyter IDE
Active learning with deep pre-trained models for sequence tagging of clinical and biomedical texts
Active learning is a technique that helps to minimize the annotation budget required for the creation of a labeled dataset while maximizing the performance of a model trained on this dataset. It has been shown that active learning can be successfully applied to sequence tagging tasks of text processing in conjunction with deep learning models even when a limited amount of labeled data is available. Recent advances in transfer learning methods for natural language processing based on deep pre-trained models such as ELMo and BERT offer a much better ability to generalize on small annotated datasets compared to their shallow counterparts. The combination of deep pre-trained models and active learning leads to a powerful approach to dealing with annotation scarcity. In this work, we investigate the potential of this approach on clinical and biomedical data. The experimental evaluation shows that the combination of active learning and deep pre-trained models outperforms the standard methods of active learning. We also suggest a modification to a standard uncertainty sampling strategy and empirically show that it could be beneficial for annotation of very skewed datasets. Finally, we propose an annotation tool empowered with active learning and deep pre-trained models that could be used for entity annotation directly from Jupyter IDE