678 research outputs found
ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models
Large pre-trained vision-language models have shown great prominence in
transferring pre-acquired knowledge to various domains and downstream tasks
with appropriate prompting or tuning. Existing prevalent tuning methods can be
generally categorized into three genres: 1) prompt engineering by creating
suitable prompt texts, which is time-consuming and requires domain expertise;
2) or simply fine-tuning the whole model, which is extremely inefficient; 3)
prompt tuning through parameterized prompt embeddings with the text encoder.
Nevertheless, all methods rely on the text encoder for bridging the modality
gap between vision and language. In this work, we question the necessity of the
cumbersome text encoder for a more lightweight and efficient tuning paradigm as
well as more representative prompt embeddings closer to the image
representations. To achieve this, we propose a Concept Embedding Search (ConES)
approach by optimizing prompt embeddings -- without the need of the text
encoder -- to capture the 'concept' of the image modality through a variety of
task objectives. By dropping the text encoder, we are able to significantly
speed up the learning process, \eg, from about an hour to just ten minutes in
our experiments for personalized text-to-image generation without impairing the
generation quality. Moreover, our proposed approach is orthogonal to current
existing tuning methods since the searched concept embeddings can be further
utilized in the next stage of fine-tuning the pre-trained large models for
boosting performance. Extensive experiments show that our approach can beat the
prompt tuning and textual inversion methods in a variety of downstream tasks
including objection detection, instance segmentation, and image generation. Our
approach also shows better generalization capability for unseen concepts in
specialized domains, such as the medical domain
Investigating the Effects of Word Substitution Errors on Sentence Embeddings
A key initial step in several natural language processing (NLP) tasks
involves embedding phrases of text to vectors of real numbers that preserve
semantic meaning. To that end, several methods have been recently proposed with
impressive results on semantic similarity tasks. However, all of these
approaches assume that perfect transcripts are available when generating the
embeddings. While this is a reasonable assumption for analysis of written text,
it is limiting for analysis of transcribed text. In this paper we investigate
the effects of word substitution errors, such as those coming from automatic
speech recognition errors (ASR), on several state-of-the-art sentence embedding
methods. To do this, we propose a new simulator that allows the experimenter to
induce ASR-plausible word substitution errors in a corpus at a desired word
error rate. We use this simulator to evaluate the robustness of several
sentence embedding methods. Our results show that pre-trained neural sentence
encoders are both robust to ASR errors and perform well on textual similarity
tasks after errors are introduced. Meanwhile, unweighted averages of word
vectors perform well with perfect transcriptions, but their performance
degrades rapidly on textual similarity tasks for text with word substitution
errors.Comment: 4 Pages, 2 figures. Copyright IEEE 2019. Accepted and to appear in
the Proceedings of the 44th International Conference on Acoustics, Speech,
and Signal Processing 2019 (IEEE-ICASSP-2019), May 12-17 in Brighton, U.K.
Personal use of this material is permitted. However, permission to
reprint/republish this material must be obtained from the IEE
Transformers and the representation of biomedical background knowledge
BioBERT and BioMegatron are Transformers models adapted for the biomedical
domain based on publicly available biomedical corpora. As such, they have the
potential to encode large-scale biological knowledge. We investigate the
encoding and representation of biological knowledge in these models, and its
potential utility to support inference in cancer precision medicine - namely,
the interpretation of the clinical significance of genomic alterations. We
compare the performance of different transformer baselines; we use probing to
determine the consistency of encodings for distinct entities; and we use
clustering methods to compare and contrast the internal properties of the
embeddings for genes, variants, drugs and diseases. We show that these models
do indeed encode biological knowledge, although some of this is lost in
fine-tuning for specific tasks. Finally, we analyse how the models behave with
regard to biases and imbalances in the dataset.Comment: 22 pages, 12 figures, supplementary methods, tables and figures at
the end of the manuscrip
Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
Text embedding models from Natural Language Processing can map text data
(e.g. words, sentences, documents) to supposedly meaningful numerical
representations (a.k.a. text embeddings). While such models are increasingly
applied in social science research, one important issue is often not addressed:
the extent to which these embeddings are valid representations of constructs
relevant for social science research. We therefore propose the use of the
classic construct validity framework to evaluate the validity of text
embeddings. We show how this framework can be adapted to the opaque and
high-dimensional nature of text embeddings, with application to survey
questions. We include several popular text embedding methods (e.g. fastText,
GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct
validity analyses. We find evidence of convergent and discriminant validity in
some cases. We also show that embeddings can be used to predict respondent's
answers to completely new survey questions. Furthermore, BERT-based embedding
techniques and the Universal Sentence Encoder provide more valid
representations of survey questions than do others. Our results thus highlight
the necessity to examine the construct validity of text embeddings before
deploying them in social science research.Comment: Under revie
Automatic information search for countering covid-19 misinformation through semantic similarity
Trabajo Fin de MĂĄster en BioinformĂĄtica y BiologĂa ComputacionalInformation quality in social media is an increasingly important issue and misinformation problem has become even more critical in the current COVID-19 pandemic, leading people exposed
to false and potentially harmful claims and rumours. Civil society organizations, such as the
World Health Organization, have demanded a global call for action to promote access to health
information and mitigate harm from health misinformation. Consequently, this project pursues
countering the spread of COVID-19 infodemic and its potential health hazards.
In this work, we give an overall view of models and methods that have been employed in the
NLP field from its foundations to the latest state-of-the-art approaches. Focusing on deep learning methods, we propose applying multilingual Transformer models based on siamese networks,
also called bi-encoders, combined with ensemble and PCA dimensionality reduction techniques.
The goal is to counter COVID-19 misinformation by analyzing the semantic similarity between
a claim and tweets from a collection gathered from official fact-checkers verified by the International Fact-Checking Network of the Poynter Institute.
It is factual that the number of Internet users increases every year and the language spoken
determines access to information online. For this reason, we give a special effort in the application of multilingual models to tackle misinformation across the globe. Regarding semantic
similarity, we firstly evaluate these multilingual ensemble models and improve the result in the
STS-Benchmark compared to monolingual and single models. Secondly, we enhance the interpretability of the modelsâ performance through the SentEval toolkit. Lastly, we compare these
modelsâ performance against biomedical models in TREC-COVID task round 1 using the BM25
Okapi ranking method as the baseline. Moreover, we are interested in understanding the ins
and outs of misinformation. For that purpose, we extend interpretability using machine learning
and deep learning approaches for sentiment analysis and topic modelling. Finally, we developed
a dashboard to ease visualization of the results.
In our view, the results obtained in this project constitute an excellent initial step toward
incorporating multilingualism and will assist researchers and people in countering COVID-19
misinformation
Probing Pre-Trained Language Models for Disease Knowledge
Pre-trained language models such as ClinicalBERT have achieved impressive
results on tasks such as medical Natural Language Inference. At first glance,
this may suggest that these models are able to perform medical reasoning tasks,
such as mapping symptoms to diseases. However, we find that standard benchmarks
such as MedNLI contain relatively few examples that require such forms of
reasoning. To better understand the medical reasoning capabilities of existing
language models, in this paper we introduce DisKnE, a new benchmark for Disease
Knowledge Evaluation. To construct this benchmark, we annotated each positive
MedNLI example with the types of medical reasoning that are needed. We then
created negative examples by corrupting these positive examples in an
adversarial way. Furthermore, we define training-test splits per disease,
ensuring that no knowledge about test diseases can be learned from the training
data, and we canonicalize the formulation of the hypotheses to avoid the
presence of artefacts. This leads to a number of binary classification
problems, one for each type of reasoning and each disease. When analysing
pre-trained models for the clinical/biomedical domain on the proposed
benchmark, we find that their performance drops considerably.Comment: Accepted by ACL 2021 Finding
- âŠ