142 research outputs found
Learning Blind Motion Deblurring
As handheld video cameras are now commonplace and available in every
smartphone, images and videos can be recorded almost everywhere at anytime.
However, taking a quick shot frequently yields a blurry result due to unwanted
camera shake during recording or moving objects in the scene. Removing these
artifacts from the blurry recordings is a highly ill-posed problem as neither
the sharp image nor the motion blur kernel is known. Propagating information
between multiple consecutive blurry observations can help restore the desired
sharp image or video. Solutions for blind deconvolution based on neural
networks rely on a massive amount of ground-truth data which is hard to
acquire. In this work, we propose an efficient approach to produce a
significant amount of realistic training data and introduce a novel recurrent
network architecture to deblur frames taking temporal information into account,
which can efficiently handle arbitrary spatial and temporal input sizes. We
demonstrate the versatility of our approach in a comprehensive comparison on a
number of challening real-world examples.Comment: International Conference on Computer Vision (ICCV) (2017
Efficient Large-scale Approximate Nearest Neighbor Search on the GPU
We present a new approach for efficient approximate nearest neighbor (ANN)
search in high dimensional spaces, extending the idea of Product Quantization.
We propose a two-level product and vector quantization tree that reduces the
number of vector comparisons required during tree traversal. Our approach also
includes a novel highly parallelizable re-ranking method for candidate vectors
by efficiently reusing already computed intermediate values. Due to its small
memory footprint during traversal, the method lends itself to an efficient,
parallel GPU implementation. This Product Quantization Tree (PQT) approach
significantly outperforms recent state of the art methods for high dimensional
nearest neighbor queries on standard reference datasets. Ours is the first work
that demonstrates GPU performance superior to CPU performance on high
dimensional, large scale ANN problems in time-critical real-world applications,
like loop-closing in videos
At-Most-Hexa Meshes
AbstractVolumetric polyhedral meshes are required in many applications, especially for solving partial differential equations on finite element simulations. Still, their construction bears several additional challenges compared to boundaryâbased representations. Tetrahedral meshes and (pure) hexâmeshes are two popular formats in scenarios like CAD applications, offering opposite advantages and disadvantages. Hexâmeshes are more intricate to construct due to the global structure of the meshing, but feature much better regularity, alignment, are more expressive, and offer the same simulation accuracy with fewer elements. Hexâdominant meshes, where most but not all cell elements have a hexahedral structure, constitute an attractive compromise, potentially unlocking benefits from both structures, but their generality makes their employment in downstream applications difficult. In this work, we introduce a strict subset of general hexâdominant meshes, which we term 'atâmostâhexa meshes', in which most cells are still hexahedral, but no cell has more than six boundary faces, and no face has more than four sides. We exemplify the ease of construction of atâmostâhexa meshes by proposing a frugal and straightforward method to generate highâquality meshes of this kind, starting directly from hulls or point clouds, for example, from a 3D scan. In contrast to existing methods for (pure) hexahedral meshing, ours does not require an intermediate parameterization of other costly preâcomputations and can start directly from surfaces or samples. We leverage a Lloyd relaxation process to exploit the synergistic effects of aligning an orientation field in a modified 3D Voronoi diagram using the norm for cubical cells. The extracted geometry incorporates regularity as well as feature alignment, following sharp edges and curved boundary surfaces. We introduce specialized operations on the threeâdimensional graph structure to enforce consistency during the relaxation. The resulting algorithm allows for an efficient evaluation with parallel algorithms on GPU hardware and completes even large reconstructions within minutes
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Providing explanations in the context of Visual Question Answering (VQA)
presents a fundamental problem in machine learning. To obtain detailed insights
into the process of generating natural language explanations for VQA, we
introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with
natural language explanations. For each image-question pair in the CLEVR
dataset, CLEVR-X contains multiple structured textual explanations which are
derived from the original scene graphs. By construction, the CLEVR-X
explanations are correct and describe the reasoning and visual information that
is necessary to answer a given question. We conducted a user study to confirm
that the ground-truth explanations in our proposed dataset are indeed complete
and relevant. We present baseline results for generating natural language
explanations in the context of VQA using two state-of-the-art frameworks on the
CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation
generation quality for different question and answer types. Additionally, we
study the influence of using different numbers of ground-truth explanations on
the convergence of natural language generation (NLG) metrics. The CLEVR-X
dataset is publicly available at
\url{https://explainableml.github.io/CLEVR-X/}
GGNN: Graph-based GPU Nearest Neighbor Search
Approximate nearest neighbor (ANN) search in high dimensions is an integral
part of several computer vision systems and gains importance in deep learning
with explicit memory representations. Since PQT and FAISS started to leverage
the massive parallelism offered by GPUs, GPU-based implementations are a
crucial resource for today's state-of-the-art ANN methods. While most of these
methods allow for faster queries, less emphasis is devoted to accelerate the
construction of the underlying index structures. In this paper, we propose a
novel search structure based on nearest neighbor graphs and information
propagation on graphs. Our method is designed to take advantage of GPU
architectures to accelerate the hierarchical building of the index structure
and for performing the query. Empirical evaluation shows that GGNN
significantly surpasses the state-of-the-art GPU- and CPU-based systems in
terms of build-time, accuracy and search speed
Language with Vision: a Study on Grounded Word and Sentence Embeddings
Grounding language in vision is an active field of research seeking to
construct cognitively plausible word and sentence representations by
incorporating perceptual knowledge from vision into text-based representations.
Despite many attempts at language grounding, achieving an optimal equilibrium
between textual representations of the language and our embodied experiences
remains an open field. Some common concerns are the following. Is visual
grounding advantageous for abstract words, or is its effectiveness restricted
to concrete words? What is the optimal way of bridging the gap between text and
vision? To what extent is perceptual knowledge from images advantageous for
acquiring high-quality embeddings? Leveraging the current advances in machine
learning and natural language processing, the present study addresses these
questions by proposing a simple yet very effective computational grounding
model for pre-trained word embeddings. Our model effectively balances the
interplay between language and vision by aligning textual embeddings with
visual information while simultaneously preserving the distributional
statistics that characterize word usage in text corpora. By applying a learned
alignment, we are able to indirectly ground unseen words including abstract
words. A series of evaluations on a range of behavioural datasets shows that
visual grounding is beneficial not only for concrete words but also for
abstract words, lending support to the indirect theory of abstract concepts.
Moreover, our approach offers advantages for contextualized embeddings, such as
those generated by BERT, but only when trained on corpora of modest,
cognitively plausible sizes. Code and grounded embeddings for English are
available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2
How direct is the link between words and images?
Current word embedding models despite their success, still suffer from their
lack of grounding in the real world. In this line of research, Gunther et al.
2022 proposed a behavioral experiment to investigate the relationship between
words and images. In their setup, participants were presented with a target
noun and a pair of images, one chosen by their model and another chosen
randomly. Participants were asked to select the image that best matched the
target noun. In most cases, participants preferred the image selected by the
model. Gunther et al., therefore, concluded the possibility of a direct link
between words and embodied experience. We took their experiment as a point of
departure and addressed the following questions. 1. Apart from utilizing
visually embodied simulation of given images, what other strategies might
subjects have used to solve this task? To what extent does this setup rely on
visual information from images? Can it be solved using purely textual
representations? 2. Do current visually grounded embeddings explain subjects'
selection behavior better than textual embeddings? 3. Does visual grounding
improve the semantic representations of both concrete and abstract words? To
address these questions, we designed novel experiments by using pre-trained
textual and visually grounded word embeddings. Our experiments reveal that
subjects' selection behavior is explained to a large extent based on purely
text-based embeddings and word-based similarities, suggesting a minor
involvement of active embodied experiences. Visually grounded embeddings
offered modest advantages over textual embeddings only in certain cases. These
findings indicate that the experiment by Gunther et al. may not be well suited
for tapping into the perceptual experience of participants, and therefore the
extent to which it measures visually grounded knowledge is unclear.Comment: Accepted in the Mental Lexicon Journal:
https://benjamins.com/catalog/m
S.T.A.R.-Track: Latent Motion Models for End-to-End 3D Object Tracking with Adaptive Spatio-Temporal Appearance Representations
Following the tracking-by-attention paradigm, this paper introduces an
object-centric, transformer-based framework for tracking in 3D. Traditional
model-based tracking approaches incorporate the geometric effect of object- and
ego motion between frames with a geometric motion model. Inspired by this, we
propose S.T.A.R.-Track, which uses a novel latent motion model (LMM) to
additionally adjust object queries to account for changes in viewing direction
and lighting conditions directly in the latent space, while still modeling the
geometric motion explicitly. Combined with a novel learnable track embedding
that aids in modeling the existence probability of tracks, this results in a
generic tracking framework that can be integrated with any query-based
detector. Extensive experiments on the nuScenes benchmark demonstrate the
benefits of our approach, showing \ac{sota} performance for DETR3D-based
trackers while drastically reducing the number of identity switches of tracks
at the same time.Comment: \c{opyright} 2023 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification
Whole slide image (WSI) assessment is a challenging and crucial step in
cancer diagnosis and treatment planning. WSIs require high magnifications to
facilitate sub-cellular analysis. Precise annotations for patch- or even
pixel-level classifications in the context of gigapixel WSIs are tedious to
acquire and require domain experts. Coarse-grained labels, on the other hand,
are easily accessible, which makes WSI classification an ideal use case for
multiple instance learning (MIL). In our work, we propose a novel
embedding-based Dual-Query MIL pipeline (DQ-MIL). We contribute to both the
embedding and aggregation steps. Since all-purpose visual feature
representations are not yet available, embedding models are currently limited
in terms of generalizability. With our work, we explore the potential of
dynamic meta-embedding based on cutting-edge self-supervised pre-trained models
in the context of MIL. Moreover, we propose a new MIL architecture capable of
combining MIL-attention with correlated self-attention. The Dual-Query
Perceiver design of our approach allows us to leverage the concept of
self-distillation and to combine the advantages of a small model in the context
of a low data regime with the rich feature representation of a larger model. We
demonstrate the superior performance of our approach on three histopathological
datasets, where we show improvement of up to 10% over state-of-the-art
approaches
- âŠ