Search CORE

142 research outputs found

Learning Blind Motion Deblurring

Author: Hirsch Michael
Lensch Hendrik P. A.
Schölkopf Bernhard
Wieschollek Patrick
Publication venue
Publication date: 01/01/2017
Field of study

As handheld video cameras are now commonplace and available in every smartphone, images and videos can be recorded almost everywhere at anytime. However, taking a quick shot frequently yields a blurry result due to unwanted camera shake during recording or moving objects in the scene. Removing these artifacts from the blurry recordings is a highly ill-posed problem as neither the sharp image nor the motion blur kernel is known. Propagating information between multiple consecutive blurry observations can help restore the desired sharp image or video. Solutions for blind deconvolution based on neural networks rely on a massive amount of ground-truth data which is hard to acquire. In this work, we propose an efficient approach to produce a significant amount of realistic training data and introduce a novel recurrent network architecture to deblur frames taking temporal information into account, which can efficiently handle arbitrary spatial and temporal input sizes. We demonstrate the versatility of our approach in a comprehensive comparison on a number of challening real-world examples.Comment: International Conference on Computer Vision (ICCV) (2017

arXiv.org e-Print Archive

MPG.PuRe

Efficient Large-scale Approximate Nearest Neighbor Search on the GPU

Author: Lensch Hendrik P. A.
Sorkine-Hornung Alexander
Wang Oliver
Wieschollek Patrick
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

We present a new approach for efficient approximate nearest neighbor (ANN) search in high dimensional spaces, extending the idea of Product Quantization. We propose a two-level product and vector quantization tree that reduces the number of vector comparisons required during tree traversal. Our approach also includes a novel highly parallelizable re-ranking method for candidate vectors by efficiently reusing already computed intermediate values. Due to its small memory footprint during traversal, the method lends itself to an efficient, parallel GPU implementation. This Product Quantization Tree (PQT) approach significantly outperforms recent state of the art methods for high dimensional nearest neighbor queries on standard reference datasets. Ours is the first work that demonstrates GPU performance superior to CPU performance on high dimensional, large scale ANN problems in time-critical real-world applications, like loop-closing in videos

arXiv.org e-Print Archive

Crossref

MPG.PuRe

At-Most-Hexa Meshes

Author: Dennis R. Bukenberger
Hendrik P. A. Lensch
Marco Tarini
Publication venue
Publication date: 07/09/2021
Field of study

AbstractVolumetric polyhedral meshes are required in many applications, especially for solving partial differential equations on finite element simulations. Still, their construction bears several additional challenges compared to boundary‐based representations. Tetrahedral meshes and (pure) hex‐meshes are two popular formats in scenarios like CAD applications, offering opposite advantages and disadvantages. Hex‐meshes are more intricate to construct due to the global structure of the meshing, but feature much better regularity, alignment, are more expressive, and offer the same simulation accuracy with fewer elements. Hex‐dominant meshes, where most but not all cell elements have a hexahedral structure, constitute an attractive compromise, potentially unlocking benefits from both structures, but their generality makes their employment in downstream applications difficult. In this work, we introduce a strict subset of general hex‐dominant meshes, which we term 'at‐most‐hexa meshes', in which most cells are still hexahedral, but no cell has more than six boundary faces, and no face has more than four sides. We exemplify the ease of construction of at‐most‐hexa meshes by proposing a frugal and straightforward method to generate high‐quality meshes of this kind, starting directly from hulls or point clouds, for example, from a 3D scan. In contrast to existing methods for (pure) hexahedral meshing, ours does not require an intermediate parameterization of other costly pre‐computations and can start directly from surfaces or samples. We leverage a Lloyd relaxation process to exploit the synergistic effects of aligning an orientation field in a modified 3D Voronoi diagram using the norm for cubical cells. The extracted geometry incorporates regularity as well as feature alignment, following sharp edges and curved boundary surfaces. We introduce specialized operations on the three‐dimensional graph structure to enforce consistency during the relaxation. The resulting algorithm allows for an efficient evaluation with parallel algorithms on GPU hardware and completes even large reconstructions within minutes

Open Access Repository

CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

Author: Akata Zeynep
Koepke A. Sophia
Lensch Hendrik P. A.
Salewski Leonard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at \url{https://explainableml.github.io/CLEVR-X/}

arXiv.org e-Print Archive

MPG.PuRe

GGNN: Graph-based GPU Nearest Neighbor Search

Author: Groh Fabian
Lensch Hendrik P. A.
Ruppert Lukas
Wieschollek Patrick
Publication venue
Publication date: 12/04/2021
Field of study

Approximate nearest neighbor (ANN) search in high dimensions is an integral part of several computer vision systems and gains importance in deep learning with explicit memory representations. Since PQT and FAISS started to leverage the massive parallelism offered by GPUs, GPU-based implementations are a crucial resource for today's state-of-the-art ANN methods. While most of these methods allow for faster queries, less emphasis is devoted to accelerate the construction of the underlying index structures. In this paper, we propose a novel search structure based on nearest neighbor graphs and information propagation on graphs. Our method is designed to take advantage of GPU architectures to accelerate the hierarchical building of the index structure and for performing the query. Empirical evaluation shows that GGNN significantly surpasses the state-of-the-art GPU- and CPU-based systems in terms of build-time, accuracy and search speed

arXiv.org e-Print Archive

Publikationsserver der Universität Tübingen

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Author: Baayen Harald
Heitmeier Maria
Lensch Hendrik P. A.
Shafaei-Bajestan Elnaz
Shahmohammadi Hassan
Publication venue
Publication date: 31/10/2023
Field of study

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioural datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT, but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2

arXiv.org e-Print Archive

How direct is the link between words and images?

Author: Baayen Harald
Heitmeier Maria
Lensch Hendrik P. A.
Shafaei-Bajestan Elnaz
Shahmohammadi Hassan
Publication venue
Publication date: 31/10/2023
Field of study

Current word embedding models despite their success, still suffer from their lack of grounding in the real world. In this line of research, Gunther et al. 2022 proposed a behavioral experiment to investigate the relationship between words and images. In their setup, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the image that best matched the target noun. In most cases, participants preferred the image selected by the model. Gunther et al., therefore, concluded the possibility of a direct link between words and embodied experience. We took their experiment as a point of departure and addressed the following questions. 1. Apart from utilizing visually embodied simulation of given images, what other strategies might subjects have used to solve this task? To what extent does this setup rely on visual information from images? Can it be solved using purely textual representations? 2. Do current visually grounded embeddings explain subjects' selection behavior better than textual embeddings? 3. Does visual grounding improve the semantic representations of both concrete and abstract words? To address these questions, we designed novel experiments by using pre-trained textual and visually grounded word embeddings. Our experiments reveal that subjects' selection behavior is explained to a large extent based on purely text-based embeddings and word-based similarities, suggesting a minor involvement of active embodied experiences. Visually grounded embeddings offered modest advantages over textual embeddings only in certain cases. These findings indicate that the experiment by Gunther et al. may not be well suited for tapping into the perceptual experience of participants, and therefore the extent to which it measures visually grounded knowledge is unclear.Comment: Accepted in the Mental Lexicon Journal: https://benjamins.com/catalog/m

arXiv.org e-Print Archive

S.T.A.R.-Track: Latent Motion Models for End-to-End 3D Object Tracking with Adaptive Spatio-Temporal Appearance Representations

Author: Doll Simon
Enzweiler Markus
Hanselmann Niklas
Lensch Hendrik P. A.
Schneider Lukas
Schulz Richard
Publication venue
Publication date: 22/12/2023
Field of study

Following the tracking-by-attention paradigm, this paper introduces an object-centric, transformer-based framework for tracking in 3D. Traditional model-based tracking approaches incorporate the geometric effect of object- and ego motion between frames with a geometric motion model. Inspired by this, we propose S.T.A.R.-Track, which uses a novel latent motion model (LMM) to additionally adjust object queries to account for changes in viewing direction and lighting conditions directly in the latent space, while still modeling the geometric motion explicitly. Combined with a novel learnable track embedding that aids in modeling the existence probability of tracks, this results in a generic tracking framework that can be integrated with any query-based detector. Extensive experiments on the nuScenes benchmark demonstrate the benefits of our approach, showing \ac{sota} performance for DETR3D-based trackers while drastically reducing the number of identity switches of tracks at the same time.Comment: \c{opyright} 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

arXiv.org e-Print Archive

Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification

Author: Fend Falko
Holdenried-Krafft Simon
Lensch Hendrik P. A.
Montes-Majarro Ivonne A.
Silimon Diana
Somers Peter
Tarín Cristina
Publication venue
Publication date: 14/07/2023
Field of study

Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning. WSIs require high magnifications to facilitate sub-cellular analysis. Precise annotations for patch- or even pixel-level classifications in the context of gigapixel WSIs are tedious to acquire and require domain experts. Coarse-grained labels, on the other hand, are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL). In our work, we propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL). We contribute to both the embedding and aggregation steps. Since all-purpose visual feature representations are not yet available, embedding models are currently limited in terms of generalizability. With our work, we explore the potential of dynamic meta-embedding based on cutting-edge self-supervised pre-trained models in the context of MIL. Moreover, we propose a new MIL architecture capable of combining MIL-attention with correlated self-attention. The Dual-Query Perceiver design of our approach allows us to leverage the concept of self-distillation and to combine the advantages of a small model in the context of a low data regime with the rich feature representation of a larger model. We demonstrate the superior performance of our approach on three histopathological datasets, where we show improvement of up to 10% over state-of-the-art approaches

arXiv.org e-Print Archive