126 research outputs found
Learning Single-Image Depth from Videos using Quality Assessment Networks
Depth estimation from a single image in the wild remains a challenging
problem. One main obstacle is the lack of high-quality training data for images
in the wild. In this paper we propose a method to automatically generate such
data through Structure-from-Motion (SfM) on Internet videos. The core of this
method is a Quality Assessment Network that identifies high-quality
reconstructions obtained from SfM. Using this method, we collect single-view
depth training data from a large number of YouTube videos and construct a new
dataset called YouTube3D. Experiments show that YouTube3D is useful in training
depth estimation networks and advances the state of the art of single-view
depth estimation in the wild
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation
The images and sounds that we perceive undergo subtle but geometrically
consistent changes as we rotate our heads. In this paper, we use these cues to
solve a problem we call Sound Localization from Motion (SLfM): jointly
estimating camera rotation and localizing sound sources. We learn to solve
these tasks solely through self-supervision. A visual model predicts camera
rotation from a pair of images, while an audio model predicts the direction of
sound sources from binaural sounds. We train these models to generate
predictions that agree with one another. At test time, the models can be
deployed independently. To obtain a feature representation that is well-suited
to solving this challenging problem, we also propose a method for learning an
audio-visual representation through cross-view binauralization: estimating
binaural sound from one view, given images and sound from another. Our model
can successfully estimate accurate rotations on both real and synthetic scenes,
and localize sound sources with accuracy competitive with state-of-the-art
self-supervised approaches. Project site: https://ificl.github.io/SLfM/Comment: ICCV 2023. Project site: https://ificl.github.io/SLfM
AVSegFormer: Audio-Visual Segmentation with Transformer
The combination of audio and vision has long been a topic of interest in the
multi-modal community. Recently, a new audio-visual segmentation (AVS) task has
been introduced, aiming to locate and segment the sounding objects in a given
video. This task demands audio-driven pixel-level scene understanding for the
first time, posing significant challenges. In this paper, we propose
AVSegFormer, a novel framework for AVS tasks that leverages the transformer
architecture. Specifically, we introduce audio queries and learnable queries
into the transformer decoder, enabling the network to selectively attend to
interested visual features. Besides, we present an audio-visual mixer, which
can dynamically adjust visual features by amplifying relevant and suppressing
irrelevant spatial channels. Additionally, we devise an intermediate mask loss
to enhance the supervision of the decoder, encouraging the network to produce
more accurate intermediate predictions. Extensive experiments demonstrate that
AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is
available at https://github.com/vvvb-github/AVSegFormer.Comment: 9 pages, 7 figure
Champion Solution for the WSDM2023 Toloka VQA Challenge
In this report, we present our champion solution to the WSDM2023 Toloka
Visual Question Answering (VQA) Challenge. Different from the common VQA and
visual grounding (VG) tasks, this challenge involves a more complex scenario,
i.e. inferring and locating the object implicitly specified by the given
interrogative question. For this task, we leverage ViT-Adapter, a
pre-training-free adapter network, to adapt multi-modal pre-trained
Uni-Perceiver for better cross-modal localization. Our method ranks first on
the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets,
respectively. It shows that ViT-Adapter is also an effective paradigm for
adapting the unified perception model to vision-language downstream tasks. Code
and models will be released at
https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.Comment: Technical report in WSDM Cup 202
Review of Calibration Methods for Scheimpflug Camera
The Scheimpflug camera offers a wide range of applications in the field of typical close-range photogrammetry, particle image velocity, and digital image correlation due to the fact that the depth-of-view of Scheimpflug camera can be greatly extended according to the Scheimpflug condition. Yet, the conventional calibration methods are not applicable in this case because the assumptions used by classical calibration methodologies are not valid anymore for cameras undergoing Scheimpflug condition. Therefore, various methods have been investigated to solve the problem over the last few years. However, no comprehensive review exists that provides an insight into recent calibration methods of Scheimpflug cameras. This paper presents a survey of recent calibration methods of Scheimpflug cameras with perspective lens, including the general nonparametric imaging model, and analyzes in detail the advantages and drawbacks of the mainstream calibration models with respect to each other. Real data experiments including calibrations, reconstructions, and measurements are performed to assess the performance of the models. The results reveal that the accuracies of the RMM, PLVM, PCIM, and GNIM are basically equal, while the accuracy of GNIM is slightly lower compared with the other three parametric models. Moreover, the experimental results reveal that the parameters of the tangential distortion are likely coupled with the tilt angle of the sensor in Scheimpflug calibration models. The work of this paper lays the foundation of further research of Scheimpflug cameras
Subgraph Frequency Distribution Estimation using Graph Neural Networks
Small subgraphs (graphlets) are important features to describe fundamental
units of a large network. The calculation of the subgraph frequency
distributions has a wide application in multiple domains including biology and
engineering. Unfortunately due to the inherent complexity of this task, most of
the existing methods are computationally intensive and inefficient. In this
work, we propose GNNS, a novel representational learning framework that
utilizes graph neural networks to sample subgraphs efficiently for estimating
their frequency distribution. Our framework includes an inference model and a
generative model that learns hierarchical embeddings of nodes, subgraphs, and
graph types. With the learned model and embeddings, subgraphs are sampled in a
highly scalable and parallel way and the frequency distribution estimation is
then performed based on these sampled subgraphs. Eventually, our methods
achieve comparable accuracy and a significant speedup by three orders of
magnitude compared to existing methods.Comment: accepted by KDD 2022 Workshop on Deep Learning on Graph
OASIS: A Large-Scale Dataset for Single Image 3D in the Wild
Single-view 3D is the task of recovering 3D properties such as depth and
surface normals from a single image. We hypothesize that a major obstacle to
single-image 3D is data. We address this issue by presenting Open Annotations
of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild
consisting of annotations of detailed 3D geometry for 140,000 images. We train
and evaluate leading models on a variety of single-image 3D tasks. We expect
OASIS to be a useful resource for 3D vision research. Project site:
https://pvl.cs.princeton.edu/OASIS.Comment: Accepted to CVPR 202
Excitation of extraordinary modes inside the source of Saturn's kilometric radiation
The electron cyclotron maser instability (ECMI) of extraordinary mode waves
was investigated with the parameters observed in Saturn's kilometric radiation
(SKR) sources. Previous studies employed simplified dispersion relations, and
did not consider the excitation of the relativistic (R) mode. This mode is
introduced by considering the relativistic effect in plasmas consisting of both
cold and hot electrons. Using particle-in-cell simulations, we investigated the
excitation of R and X modes based on the measured data. Using the reported
value of the density ratio of energetic to total electrons , the
most unstable mode is the R mode. The escaping X-mode emissions are amplified
only if the energetic electrons are dominant with . For these
cases, only the X mode is excited and the R mode disappears due to its strong
coupling. The results are well in line with the linear kinetic theory of ECMI.
The properties of both the R and X modes are consistent with the observed SKR
emissions. This raises questions about the nature of the measured electric
field fluctuations within ``presumed'' SKR sources. The study provides new
insights into the ECMI process relevant to SKR emission mechanisms
Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network
Syntax knowledge contributes its powerful strength in Neural machine
translation (NMT) tasks. Early NMT works supposed that syntax details can be
automatically learned from numerous texts via attention networks. However,
succeeding researches pointed out that limited by the uncontrolled nature of
attention computation, the NMT model requires an external syntax to capture the
deep syntactic awareness. Although existing syntax-aware NMT methods have bored
great fruits in combining syntax, the additional workloads they introduced
render the model heavy and slow. Particularly, these efforts scarcely involve
the Transformer-based NMT and modify its core self-attention network (SAN). To
this end, we propose a parameter-free, dependency-scaled self-attention network
(Deps-SAN) for syntax-aware Transformer-based NMT. A quantified matrix of
dependency closeness between tokens is constructed to impose explicit syntactic
constraints into the SAN for learning syntactic details and dispelling the
dispersion of attention distributions. Two knowledge sparsing techniques are
further integrated to avoid the model overfitting the dependency noises
introduce by the external parser. Experiments and analyses on IWSLT14
German-to-English and WMT16 German-to-English benchmark NMT tasks verify the
effectiveness of our approach
- …