11 research outputs found
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Modular vision-language models (Vision-LLMs) align pretrained image encoders
with (pretrained) large language models (LLMs), representing a computationally
much more efficient alternative to end-to-end training of large vision-language
models from scratch, which is prohibitively expensive for most. Vision-LLMs
instead post-hoc condition LLMs to `understand' the output of an image encoder.
With the abundance of readily available high-quality English image-text data as
well as monolingual English LLMs, the research focus has been on English-only
Vision-LLMs. Multilingual vision-language models are still predominantly
obtained via expensive end-to-end pretraining, resulting in comparatively
smaller models, trained on limited multilingual image data supplemented with
text-only multilingual corpora. In this work, we present mBLIP, the first
multilingual Vision-LLM, which we obtain in a computationally efficient manner
-- on consumer hardware using only a few million training examples -- by
leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an
image encoder previously tuned to an English LLM to a new, multilingual LLM --
for this, we leverage multilingual data from a mix of vision-and-language
tasks, which we obtain by machine-translating high-quality English data to 95
languages. On the IGLUE benchmark, mBLIP yields results competitive with
state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP
(zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to
these very large multilingual vision-language models trained from scratch, we
obtain mBLIP by training orders of magnitude fewer parameters on magnitudes
less data. We release our model and code at
\url{https://github.com/gregor-ge/mBLIP}
TWEAC: Transformer with Extendable QA Agent Classifiers
Question answering systems should help users to access knowledge on a broad
range of topics and to answer a wide array of different questions. Most systems
fall short of this expectation as they are only specialized in one particular
setting, e.g., answering factual questions with Wikipedia data. To overcome
this limitation, we propose composing multiple QA agents within a meta-QA
system. We argue that there exist a wide range of specialized QA agents in
literature. Thus, we address the central research question of how to
effectively and efficiently identify suitable QA agents for any given question.
We study both supervised and unsupervised approaches to address this challenge,
showing that TWEAC - Transformer with Extendable Agent Classifiers - achieves
the best performance overall with 94% accuracy. We provide extensive insights
on the scalability of TWEAC, demonstrating that it scales robustly to over 100
QA agents with each providing just 1000 examples of questions they can answer
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs—of different architectures, trained on different data and objectives—are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
Current state-of-the-art approaches to cross-modal retrieval process text and
visual input jointly, relying on Transformer-based architectures with
cross-attention mechanisms that attend over all words and objects in an image.
While offering unmatched retrieval performance, such models: 1) are typically
pretrained from scratch and thus less scalable, 2) suffer from huge retrieval
latency and inefficiency issues, which makes them impractical in realistic
applications. To address these crucial gaps towards both improved and efficient
cross-modal retrieval, we propose a novel fine-tuning framework that turns any
pretrained text-image multi-modal model into an efficient retrieval model. The
framework is based on a cooperative retrieve-and-rerank approach which
combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items
of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder
component for a more nuanced (i.e., smarter) ranking of the retrieved small set
of items. We also propose to jointly fine-tune the two components with shared
weights, yielding a more parameter-efficient model. Our experiments on a series
of standard cross-modal retrieval benchmarks in monolingual, multilingual, and
zero-shot setups, demonstrate improved accuracy and huge efficiency benefits
over the state-of-the-art cross-encoders.Comment: TACL 202
xGQA: Cross-Lingual Visual Question Answering
Recent advances in multimodal vision and language modeling have predominantly
focused on the English language, mostly due to the lack of multilingual
multimodal datasets to steer modeling efforts. In this work, we address this
gap and provide xGQA, a new multilingual evaluation benchmark for the visual
question answering task. We extend the established English GQA dataset to 7
typologically diverse languages, enabling us to detect and explore crucial
challenges in cross-lingual visual question answering. We further propose new
adapter-based approaches to adapt multimodal transformer-based models to become
multilingual, and -- vice versa -- multilingual models to become multimodal.
Our proposed methods outperform current state-of-the-art multilingual
multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the
accuracy remains low across the board; a performance drop of around 38 accuracy
points in target languages showcases the difficulty of zero-shot cross-lingual
transfer for this task. Our results suggest that simple cross-lingual transfer
of multimodal models yields latent multilingual multimodal misalignment,
calling for more sophisticated methods for vision and multilingual language
modeling.Comment: Findings of ACL 202
UKP-SQUARE: An Online Platform for Question Answering Research
Recent advances in NLP and information retrieval have given rise to a diverse
set of question answering tasks that are of different formats (e.g.,
extractive, abstractive), require different model architectures (e.g.,
generative, discriminative), and setups (e.g., with or without retrieval).
Despite having a large number of powerful, specialized QA pipelines (which we
refer to as Skills) that consider a single domain, model or setup, there exists
no framework where users can easily explore and compare such pipelines and can
extend them according to their needs. To address this issue, we present
UKP-SQUARE, an extensible online QA platform for researchers which allows users
to query and analyze a large collection of modern Skills via a user-friendly
web interface and integrated behavioural tests. In addition, QA researchers can
develop, manage, and share their custom Skills using our microservices that
support a wide range of models (Transformers, Adapters, ONNX), datastores and
retrieval techniques (e.g., sparse and dense). UKP-SQUARE is available on
https://square.ukp-lab.de.Comment: Accepted at ACL 2022 Demo Trac