537 research outputs found
D2S: Representing local descriptors and global scene coordinates for camera relocalization
State-of-the-art visual localization methods mostly rely on complex
procedures to match local descriptors and 3D point clouds. However, these
procedures can incur significant cost in terms of inference, storage, and
updates over time. In this study, we propose a direct learning-based approach
that utilizes a simple network named D2S to represent local descriptors and
their scene coordinates. Our method is characterized by its simplicity and
cost-effectiveness. It solely leverages a single RGB image for localization
during the testing phase and only requires a lightweight model to encode a
complex sparse scene. The proposed D2S employs a combination of a simple loss
function and graph attention to selectively focus on robust descriptors while
disregarding areas such as clouds, trees, and several dynamic objects. This
selective attention enables D2S to effectively perform a binary-semantic
classification for sparse descriptors. Additionally, we propose a new outdoor
dataset to evaluate the capabilities of visual localization methods in terms of
scene generalization and self-updating from unlabeled observations. Our
approach outperforms the state-of-the-art CNN-based methods in scene coordinate
regression in indoor and outdoor environments. It demonstrates the ability to
generalize beyond training data, including scenarios involving transitions from
day to night and adapting to domain shifts, even in the absence of the labeled
data sources. The source code, trained models, dataset, and demo videos are
available at the following link: https://thpjp.github.io/d2
Descriptor transition tables for object retrieval using unconstrained cluttered video acquired using a consumer level handheld mobile device
Visual recognition and vision based retrieval of objects from large databases are tasks with a wide spectrum of potential applications. In this paper we propose a novel recognition method from video sequences suitable for retrieval from databases acquired in highly unconstrained conditions e.g. using a mobile consumer-level device such as a phone. On the lowest level, we represent each sequence as a 3D mesh of densely packed local appearance descriptors. While image plane geometry is captured implicitly by a large overlap of neighbouring regions from which the descriptors are extracted, 3D information is extracted by means of a descriptor transition table, learnt from a single sequence for each known gallery object. These allow us to connect local descriptors along the 3rd dimension (which corresponds to viewpoint changes), thus resulting in a set of variable length Markov chains for each video. The matching of two sets of such chains is formulated as a statistical hypothesis test, whereby a subset of each is chosen to maximize the likelihood that the corresponding video sequences show the same object. The effectiveness of the proposed algorithm is empirically evaluated on the Amsterdam Library of Object Images and a new highly challenging video data set acquired using a mobile phone. On both data sets our method is shown to be successful in recognition in the presence of background clutter and large viewpoint changes.Postprin
Teleoperated visual inspection and surveillance with unmanned ground and aerial vehicles,” Int
Abstract—This paper introduces our robotic system named UGAV (Unmanned Ground-Air Vehicle) consisting of two semi-autonomous robot platforms, an Unmanned Ground Vehicle (UGV) and an Unmanned Aerial Vehicles (UAV). The paper focuses on three topics of the inspection with the combined UGV and UAV: (A) teleoperated control by means of cell or smart phones with a new concept of automatic configuration of the smart phone based on a RKI-XML description of the vehicles control capabilities, (B) the camera and vision system with the focus to real time feature extraction e.g. for the tracking of the UAV and (C) the architecture and hardware of the UAV
Circulant temporal encoding for video retrieval and temporal alignment
We address the problem of specific video event retrieval. Given a query video
of a specific event, e.g., a concert of Madonna, the goal is to retrieve other
videos of the same event that temporally overlap with the query. Our approach
encodes the frame descriptors of a video to jointly represent their appearance
and temporal order. It exploits the properties of circulant matrices to
efficiently compare the videos in the frequency domain. This offers a
significant gain in complexity and accurately localizes the matching parts of
videos. The descriptors can be compressed in the frequency domain with a
product quantizer adapted to complex numbers. In this case, video retrieval is
performed without decompressing the descriptors. We also consider the temporal
alignment of a set of videos. We exploit the matching confidence and an
estimate of the temporal offset computed for all pairs of videos by our
retrieval approach. Our robust algorithm aligns the videos on a global timeline
by maximizing the set of temporally consistent matches. The global temporal
alignment enables synchronous playback of the videos of a given scene
- …