7 research outputs found
Sample4Geo: Hard Negative Sampling For Cross-View Geo-Localisation
Cross-View Geo-Localisation is still a challenging task where additional
modules, specific pre-processing or zooming strategies are necessary to
determine accurate positions of images. Since different views have different
geometries, pre-processing like polar transformation helps to merge them.
However, this results in distorted images which then have to be rectified.
Adding hard negatives to the training batch could improve the overall
performance but with the default loss functions in geo-localisation it is
difficult to include them. In this article, we present a simplified but
effective architecture based on contrastive learning with symmetric InfoNCE
loss that outperforms current state-of-the-art results. Our framework consists
of a narrow training pipeline that eliminates the need of using aggregation
modules, avoids further pre-processing steps and even increases the
generalisation capability of the model to unknown regions. We introduce two
types of sampling strategies for hard negatives. The first explicitly exploits
geographically neighboring locations to provide a good starting point. The
second leverages the visual similarity between the image embeddings in order to
mine hard negative samples. Our work shows excellent performance on common
cross-view datasets like CVUSA, CVACT, University-1652 and VIGOR. A comparison
between cross-area and same-area settings demonstrate the good generalisation
capability of our model
Orientation-Guided Contrastive Learning for UAV-View Geo-Localisation
Retrieving relevant multimedia content is one of the main problems in a world
that is increasingly data-driven. With the proliferation of drones, high
quality aerial footage is now available to a wide audience for the first time.
Integrating this footage into applications can enable GPS-less geo-localisation
or location correction.
In this paper, we present an orientation-guided training framework for
UAV-view geo-localisation. Through hierarchical localisation orientations of
the UAV images are estimated in relation to the satellite imagery. We propose a
lightweight prediction module for these pseudo labels which predicts the
orientation between the different views based on the contrastive learned
embeddings. We experimentally demonstrate that this prediction supports the
training and outperforms previous approaches. The extracted pseudo-labels also
enable aligned rotation of the satellite image as augmentation to further
strengthen the generalisation. During inference, we no longer need this
orientation module, which means that no additional computations are required.
We achieve state-of-the-art results on both the University-1652 and
University-160k datasets
NeRFtrinsic Four: An End-To-End Trainable NeRF Jointly Optimizing Diverse Intrinsic and Extrinsic Camera Parameters
Novel view synthesis using neural radiance fields (NeRF) is the
state-of-the-art technique for generating high-quality images from novel
viewpoints. Existing methods require a priori knowledge about extrinsic and
intrinsic camera parameters. This limits their applicability to synthetic
scenes, or real-world scenarios with the necessity of a preprocessing step.
Current research on the joint optimization of camera parameters and NeRF
focuses on refining noisy extrinsic camera parameters and often relies on the
preprocessing of intrinsic camera parameters. Further approaches are limited to
cover only one single camera intrinsic. To address these limitations, we
propose a novel end-to-end trainable approach called NeRFtrinsic Four. We
utilize Gaussian Fourier features to estimate extrinsic camera parameters and
dynamically predict varying intrinsic camera parameters through the supervision
of the projection error. Our approach outperforms existing joint optimization
methods on LLFF and BLEFF. In addition to these existing datasets, we introduce
a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic
Four is a step forward in joint optimization NeRF-based view synthesis and
enables more realistic and flexible rendering in real-world scenarios with
varying camera parameters
Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model
Current architectures for multi-modality tasks such as visual question
answering suffer from their high complexity. As a result, these architectures
are difficult to train and require high computational resources. To address
these problems we present a CLIP-based architecture that does not require any
fine-tuning of the feature extractors. A simple linear classifier is used on
the concatenated features of the image and text encoder. During training an
auxiliary loss is added which operates on the answer types. The resulting
classification is then used as an attention gate on the answer class selection.
On the VizWiz 2022 Visual Question Answering Challenge we achieve 60.15 %
accuracy on Task 1: Predict Answer to a Visual Question and AP score of 83.78 %
on Task 2: Predict Answerability of a Visual Question.Comment: VizWiz Grand Challenge: Describing Images and Videos Taken by Blind
People (CVPR Workshop 2022
SoccerNet 2023 challenges results
SoccerNet 2023 Challenges ResultsThe SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet
Stratospheric ozone: An introduction to its study
info:eu-repo/semantics/publishe