Search CORE

11 research outputs found

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

Author: Geigle Gregor
Glavaš Goran
Jain Abhay
Timofte Radu
Publication venue
Publication date: 13/07/2023
Field of study

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}

arXiv.org e-Print Archive

Interaction of laser radiation with soot

Author: Gebel Gregor
Geigle Klaus Peter
Köhler Markus
Publication venue: ONERA
Publication date: 01/01/2012
Field of study

Institute of Transport Research:Publications

An experimental setup to study soot sublimation as typically occurring in high fluence LII

Author: Gebel Gregor
Geigle Klaus-Peter
Köhler Markus
Publication venue
Publication date: 01/01/2012
Field of study

Institute of Transport Research:Publications

TWEAC: Transformer with Extendable QA Agent Classifiers

Author: Geigle Gregor
Gurevych Iryna
Reimers Nils
Rücklé Andreas
Publication venue: 'Center for Open Science'
Publication date: 14/04/2021
Field of study

Question answering systems should help users to access knowledge on a broad range of topics and to answer a wide array of different questions. Most systems fall short of this expectation as they are only specialized in one particular setting, e.g., answering factual questions with Wikipedia data. To overcome this limitation, we propose composing multiple QA agents within a meta-QA system. We argue that there exist a wide range of specialized QA agents in literature. Thus, we address the central research question of how to effectively and efficiently identify suitable QA agents for any given question. We study both supervised and unsupervised approaches to address this challenge, showing that TWEAC - Transformer with Extendable Agent Classifiers - achieves the best performance overall with 94% accuracy. We provide extensive insights on the scalability of TWEAC, demonstrating that it scales robustly to over 100 QA agents with each providing just 1000 examples of questions they can answer

arXiv.org e-Print Archive

TUbiblio

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Author: Geigle Gregor
Gurevych Iryna
Liu Chen
Pfeiffer Jonas
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 10/07/2023
Field of study

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs—of different architectures, trained on different data and objectives—are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks

TUbiblio

Temperature measurements in confined swirling spray flames by vibrational coherent anti-stokes Raman spectroscopy

Author: Blakey
Eckbreth
Edwards
Geigle
Geigle
Geigle
Gregor
Jasper Grohmann
Luca M.L. Cantu
Lückerath
Maker
Manfred Aigner
Meier
Prabasena
Wolfgang Meier
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval

Author: Geigle Gregor
Gurevych Iryna
Pfeiffer Jonas
Reimers Nils
Vulić Ivan
Publication venue: 'Center for Open Science'
Publication date: 22/03/2021
Field of study

Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross-modal retrieval, we propose a novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach which combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine-tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.Comment: TACL 202

arXiv.org e-Print Archive

TUbiblio

AdapterDrop: On the Efficiency of Adapters in Transformers

Author: Beck Tilman
Geigle Gregor
Glockner Max
Gurevych Iryna
Pfeiffer Jonas
Reimers Nils
Rücklé Andreas
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 26/08/2021
Field of study

TUbiblio

xGQA: Cross-Lingual Visual Question Answering

Author: Geigle Gregor
Gurevych Iryna
Kamath Aishwarya
Pfeiffer Jonas
Roth Stefan
Steitz Jan-Martin
Vulić Ivan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 17/03/2022
Field of study

Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and -- vice versa -- multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling.Comment: Findings of ACL 202

arXiv.org e-Print Archive

TUbiblio

UKP-SQUARE: An Online Platform for Question Answering Research

Author: Baumgärtner Tim
Eichler Max
Geigle Gregor
Gurevych Iryna
Pfeiffer Jonas
Poth Clifton
Puerto Haritz
Reimers Nils
Ribeiro Leonardo F. R.
Sachdeva Rachneet
Sterz Hannah
Wang Kexin
Şahin Gözde Gül
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 28/03/2022
Field of study

Recent advances in NLP and information retrieval have given rise to a diverse set of question answering tasks that are of different formats (e.g., extractive, abstractive), require different model architectures (e.g., generative, discriminative), and setups (e.g., with or without retrieval). Despite having a large number of powerful, specialized QA pipelines (which we refer to as Skills) that consider a single domain, model or setup, there exists no framework where users can easily explore and compare such pipelines and can extend them according to their needs. To address this issue, we present UKP-SQUARE, an extensible online QA platform for researchers which allows users to query and analyze a large collection of modern Skills via a user-friendly web interface and integrated behavioural tests. In addition, QA researchers can develop, manage, and share their custom Skills using our microservices that support a wide range of models (Transformers, Adapters, ONNX), datastores and retrieval techniques (e.g., sparse and dense). UKP-SQUARE is available on https://square.ukp-lab.de.Comment: Accepted at ACL 2022 Demo Trac

arXiv.org e-Print Archive

TUbiblio