144 research outputs found
Distributed bundle adjustment with block-based sparse matrix compression for super large scale datasets
We propose a distributed bundle adjustment (DBA) method using the exact
Levenberg-Marquardt (LM) algorithm for super large-scale datasets. Most of the
existing methods partition the global map to small ones and conduct bundle
adjustment in the submaps. In order to fit the parallel framework, they use
approximate solutions instead of the LM algorithm. However, those methods often
give sub-optimal results. Different from them, we utilize the exact LM
algorithm to conduct global bundle adjustment where the formation of the
reduced camera system (RCS) is actually parallelized and executed in a
distributed way. To store the large RCS, we compress it with a block-based
sparse matrix compression format (BSMC), which fully exploits its block
feature. The BSMC format also enables the distributed storage and updating of
the global RCS. The proposed method is extensively evaluated and compared with
the state-of-the-art pipelines using both synthetic and real datasets.
Preliminary results demonstrate the efficient memory usage and vast scalability
of the proposed method compared with the baselines. For the first time, we
conducted parallel bundle adjustment using LM algorithm on a real datasets with
1.18 million images and a synthetic dataset with 10 million images (about 500
times that of the state-of-the-art LM-based BA) on a distributed computing
system.Comment: camera ready version for ICCV202
RADA: Robust Adversarial Data Augmentation for Camera Localization in Challenging Conditions
Camera localization is a fundamental problem for many applications in computer vision, robotics, and autonomy. Despite recent deep learning-based approaches, the lack of robustness in challenging conditions persists due to changes in appearance caused by texture-less planes, repeating structures, reflective surfaces, motion blur, and illumination changes. Data augmentation is an attractive solution, but standard image perturbation methods fail to improve localization robustness. To address this, we propose RADA, which concentrates on perturbing the most vulnerable pixels to generate relatively less image perturbations that perplex the network. Our method outperforms previous augmentation techniques, achieving up to twice the accuracy of state-of-the-art models even under ’unseen’ challenging weather conditions. Videos of our results can be found at https://youtu.be/niOv7- fJeCA. The source code for RADA is publicly available at https://github.com/jialuwang123321/RAD
A Temporal Sequence Learning for Action Recognition and Prediction
In this work\footnote {This work was supported in part by the National
Science Foundation under grant IIS-1212948.}, we present a method to represent
a video with a sequence of words, and learn the temporal sequencing of such
words as the key information for predicting and recognizing human actions. We
leverage core concepts from the Natural Language Processing (NLP) literature
used in sentence classification to solve the problems of action prediction and
action recognition. Each frame is converted into a word that is represented as
a vector using the Bag of Visual Words (BoW) encoding method. The words are
then combined into a sentence to represent the video, as a sentence. The
sequence of words in different actions are learned with a simple but effective
Temporal Convolutional Neural Network (T-CNN) that captures the temporal
sequencing of information in a video sentence. We demonstrate that a key
characteristic of the proposed method is its low-latency, i.e. its ability to
predict an action accurately with a partial sequence (sentence). Experiments on
two datasets, \textit{UCF101} and \textit{HMDB51} show that the method on
average reaches 95\% of its accuracy within half the video frames. Results,
also demonstrate that our method achieves compatible state-of-the-art
performance in action recognition (i.e. at the completion of the sentence) in
addition to action prediction.Comment: 10 pages, 8 figures, 2018 IEEE Winter Conference on Applications of
Computer Vision (WACV
Visual-LiDAR Odometry and Mapping with Monocular Scale Correction and Motion Compensation
This paper presents a novel visual-LiDAR odometry and mapping method with
low-drift characteristics. The proposed method is based on two popular
approaches, ORB-SLAM and A-LOAM, with monocular scale correction and
visual-assisted LiDAR motion compensation modifications. The scale corrector
calculates the proportion between the depth of image keypoints recovered by
triangulation and that provided by LiDAR, using an outlier rejection process
for accuracy improvement. Concerning LiDAR motion compensation, the visual
odometry approach gives the initial guesses of LiDAR motions for better
performance. This methodology is not only applicable to high-resolution LiDAR
but can also adapt to low-resolution LiDAR. To evaluate the proposed SLAM
system's robustness and accuracy, we conducted experiments on the KITTI
Odometry and S3E datasets. Experimental results illustrate that our method
significantly outperforms standalone ORB-SLAM2 and A-LOAM. Furthermore,
regarding the accuracy of visual odometry with scale correction, our method
performs similarly to the stereo-mode ORB-SLAM2.Comment: 7 pages, 7 figures, 31 reference
Self-supervised Interest Point Detection and Description for Fisheye and Perspective Images
Keypoint detection and matching is a fundamental task in many computer vision
problems, from shape reconstruction, to structure from motion, to AR/VR
applications and robotics. It is a well-studied problem with remarkable
successes such as SIFT, and more recent deep learning approaches. While great
robustness is exhibited by these techniques with respect to noise, illumination
variation, and rigid motion transformations, less attention has been placed on
image distortion sensitivity. In this work, we focus on the case when this is
caused by the geometry of the cameras used for image acquisition, and consider
the keypoint detection and matching problem between the hybrid scenario of a
fisheye and a projective image. We build on a state-of-the-art approach and
derive a self-supervised procedure that enables training an interest point
detector and descriptor network. We also collected two new datasets for
additional training and testing in this unexplored scenario, and we demonstrate
that current approaches are suboptimal because they are designed to work in
traditional projective conditions, while the proposed approach turns out to be
the most effective.Comment: CVPR Workshop on Omnidirectional Computer Vision, 202
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Deriving reliable region-word alignment from image-text pairs is critical to
learn object-level vision-language representations for open-vocabulary object
detection. Existing methods typically rely on pre-trained or self-trained
vision-language models for alignment, which are prone to limitations in
localization accuracy or generalization capabilities. In this paper, we propose
CoDet, a novel approach that overcomes the reliance on pre-aligned
vision-language space by reformulating region-word alignment as a co-occurring
object discovery problem. Intuitively, by grouping images that mention a shared
concept in their captions, objects corresponding to the shared concept shall
exhibit high co-occurrence among the group. CoDet then leverages visual
similarities to discover the co-occurring objects and align them with the
shared concept. Extensive experiments demonstrate that CoDet has superior
performances and compelling scalability in open-vocabulary detection, e.g., by
scaling up the visual backbone, CoDet achieves 37.0 and
44.7 on OV-LVIS, surpassing the previous SoTA by 4.2
and 9.8 . Code is available at
https://github.com/CVMI-Lab/CoDet.Comment: Accepted by NeurIPS 202
- …