133,084 research outputs found
MeshLoc: Mesh-Based Visual Localization
Visual localization, i.e., the problem of camera pose estimation, is a
central component of applications such as autonomous robots and augmented
reality systems. A dominant approach in the literature, shown to scale to large
scenes and to handle complex illumination and seasonal changes, is based on
local features extracted from images. The scene representation is a sparse
Structure-from-Motion point cloud that is tied to a specific local feature.
Switching to another feature type requires an expensive feature matching step
between the database images used to construct the point cloud. In this work, we
thus explore a more flexible alternative based on dense 3D meshes that does not
require features matching between database images to build the scene
representation. We show that this approach can achieve state-of-the-art
results. We further show that surprisingly competitive results can be obtained
when extracting features on renderings of these meshes, without any neural
rendering stage, and even when rendering raw scene geometry without color or
texture. Our results show that dense 3D model-based representations are a
promising alternative to existing representations and point to interesting and
challenging directions for future research.Comment: to be published in the proceedings of ECCV 2022, code repository:
https://github.com/tsattler/meshloc_releas
Scene extraction in motion pictures
This paper addresses the challenge of bridging the semantic gap between the rich meaning users desire when they query to locate and browse media and the shallowness of media descriptions that can be computed in today\u27s content management systems. To facilitate high-level semantics-based content annotation and interpretation, we tackle the problem of automatic decomposition of motion pictures into meaningful story units, namely scenes. Since a scene is a complicated and subjective concept, we first propose guidelines from fill production to determine when a scene change occurs. We then investigate different rules and conventions followed as part of Fill Grammar that would guide and shape an algorithmic solution for determining a scene. Two different techniques using intershot analysis are proposed as solutions in this paper. In addition, we present different refinement mechanisms, such as film-punctuation detection founded on Film Grammar, to further improve the results. These refinement techniques demonstrate significant improvements in overall performance. Furthermore, we analyze errors in the context of film-production techniques, which offer useful insights into the limitations of our method
Scene Coordinate Regression with Angle-Based Reprojection Loss for Camera Relocalization
Image-based camera relocalization is an important problem in computer vision
and robotics. Recent works utilize convolutional neural networks (CNNs) to
regress for pixels in a query image their corresponding 3D world coordinates in
the scene. The final pose is then solved via a RANSAC-based optimization scheme
using the predicted coordinates. Usually, the CNN is trained with ground truth
scene coordinates, but it has also been shown that the network can discover 3D
scene geometry automatically by minimizing single-view reprojection loss.
However, due to the deficiencies of the reprojection loss, the network needs to
be carefully initialized. In this paper, we present a new angle-based
reprojection loss, which resolves the issues of the original reprojection loss.
With this new loss function, the network can be trained without careful
initialization, and the system achieves more accurate results. The new loss
also enables us to utilize available multi-view constraints, which further
improve performance.Comment: ECCV 2018 Workshop (Geometry Meets Deep Learning
- …