1,631 research outputs found
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Modular vision-language models (Vision-LLMs) align pretrained image encoders
with (pretrained) large language models (LLMs), representing a computationally
much more efficient alternative to end-to-end training of large vision-language
models from scratch, which is prohibitively expensive for most. Vision-LLMs
instead post-hoc condition LLMs to `understand' the output of an image encoder.
With the abundance of readily available high-quality English image-text data as
well as monolingual English LLMs, the research focus has been on English-only
Vision-LLMs. Multilingual vision-language models are still predominantly
obtained via expensive end-to-end pretraining, resulting in comparatively
smaller models, trained on limited multilingual image data supplemented with
text-only multilingual corpora. In this work, we present mBLIP, the first
multilingual Vision-LLM, which we obtain in a computationally efficient manner
-- on consumer hardware using only a few million training examples -- by
leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an
image encoder previously tuned to an English LLM to a new, multilingual LLM --
for this, we leverage multilingual data from a mix of vision-and-language
tasks, which we obtain by machine-translating high-quality English data to 95
languages. On the IGLUE benchmark, mBLIP yields results competitive with
state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP
(zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to
these very large multilingual vision-language models trained from scratch, we
obtain mBLIP by training orders of magnitude fewer parameters on magnitudes
less data. We release our model and code at
\url{https://github.com/gregor-ge/mBLIP}
VISCOUNTH: A Large-Scale Multilingual Visual Question Answering Dataset for Cultural Heritage
Visual question answering has recently been settled as a fundamental multi-modal reasoning task of artificial intelligence that allows users to get information about visual content by asking questions in natural language. In the cultural heritage domain this task can contribute to assist visitors in museums and cultural sites, thus increasing engagement. However, the development of visual question answering models for cultural heritage is prevented by the lack of suitable large-scale datasets. To meet this demand, we built a large-scale heterogeneous and multilingual (Italian and English) dataset for cultural heritage that comprises approximately 500K Italian cultural assets and 6.5M question-answer pairs. We propose a novel formulation of the task that requires reasoning over both the visual content and an associated natural language description, and present baselines for this task. Results show that the current state of the art is reasonably effective, but still far from satisfactory, therefore further research is this area is recommended. Nonetheless, we also present a holistic baseline to address visual and contextual questions and foster future research on the topic
Information extraction pipelines for knowledge graphs
In the last decade, a large number of knowledge graph (KG) completion approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG completion have not been studied in the literature. We extend Plumber, a framework that brings together the research community’s disjoint efforts on KG completion. We include more components into the architecture of Plumber to comprise 40 reusable components for various KG completion subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components, Plumber dynamically generates suitable knowledge extraction pipelines and offers overall 432 distinct pipelines. We study the optimization problem of choosing optimal pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over three KGs: DBpedia, Wikidata, and Open Research Knowledge Graph. Our results demonstrate the effectiveness of Plumber in dynamically generating KG completion pipelines, outperforming all baselines agnostic of the underlying KG. Furthermore, we provide an analysis of collective failure cases, study the similarities and synergies among integrated components and discuss their limitations
- …