126 research outputs found

    Learning Single-Image Depth from Videos using Quality Assessment Networks

    Full text link
    Depth estimation from a single image in the wild remains a challenging problem. One main obstacle is the lack of high-quality training data for images in the wild. In this paper we propose a method to automatically generate such data through Structure-from-Motion (SfM) on Internet videos. The core of this method is a Quality Assessment Network that identifies high-quality reconstructions obtained from SfM. Using this method, we collect single-view depth training data from a large number of YouTube videos and construct a new dataset called YouTube3D. Experiments show that YouTube3D is useful in training depth estimation networks and advances the state of the art of single-view depth estimation in the wild

    Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

    Full text link
    The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from binaural sounds. We train these models to generate predictions that agree with one another. At test time, the models can be deployed independently. To obtain a feature representation that is well-suited to solving this challenging problem, we also propose a method for learning an audio-visual representation through cross-view binauralization: estimating binaural sound from one view, given images and sound from another. Our model can successfully estimate accurate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches. Project site: https://ificl.github.io/SLfM/Comment: ICCV 2023. Project site: https://ificl.github.io/SLfM

    AVSegFormer: Audio-Visual Segmentation with Transformer

    Full text link
    The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.Comment: 9 pages, 7 figure

    Champion Solution for the WSDM2023 Toloka VQA Challenge

    Full text link
    In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.Comment: Technical report in WSDM Cup 202

    Review of Calibration Methods for Scheimpflug Camera

    Get PDF
    The Scheimpflug camera offers a wide range of applications in the field of typical close-range photogrammetry, particle image velocity, and digital image correlation due to the fact that the depth-of-view of Scheimpflug camera can be greatly extended according to the Scheimpflug condition. Yet, the conventional calibration methods are not applicable in this case because the assumptions used by classical calibration methodologies are not valid anymore for cameras undergoing Scheimpflug condition. Therefore, various methods have been investigated to solve the problem over the last few years. However, no comprehensive review exists that provides an insight into recent calibration methods of Scheimpflug cameras. This paper presents a survey of recent calibration methods of Scheimpflug cameras with perspective lens, including the general nonparametric imaging model, and analyzes in detail the advantages and drawbacks of the mainstream calibration models with respect to each other. Real data experiments including calibrations, reconstructions, and measurements are performed to assess the performance of the models. The results reveal that the accuracies of the RMM, PLVM, PCIM, and GNIM are basically equal, while the accuracy of GNIM is slightly lower compared with the other three parametric models. Moreover, the experimental results reveal that the parameters of the tangential distortion are likely coupled with the tilt angle of the sensor in Scheimpflug calibration models. The work of this paper lays the foundation of further research of Scheimpflug cameras

    Subgraph Frequency Distribution Estimation using Graph Neural Networks

    Full text link
    Small subgraphs (graphlets) are important features to describe fundamental units of a large network. The calculation of the subgraph frequency distributions has a wide application in multiple domains including biology and engineering. Unfortunately due to the inherent complexity of this task, most of the existing methods are computationally intensive and inefficient. In this work, we propose GNNS, a novel representational learning framework that utilizes graph neural networks to sample subgraphs efficiently for estimating their frequency distribution. Our framework includes an inference model and a generative model that learns hierarchical embeddings of nodes, subgraphs, and graph types. With the learned model and embeddings, subgraphs are sampled in a highly scalable and parallel way and the frequency distribution estimation is then performed based on these sampled subgraphs. Eventually, our methods achieve comparable accuracy and a significant speedup by three orders of magnitude compared to existing methods.Comment: accepted by KDD 2022 Workshop on Deep Learning on Graph

    OASIS: A Large-Scale Dataset for Single Image 3D in the Wild

    Full text link
    Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: https://pvl.cs.princeton.edu/OASIS.Comment: Accepted to CVPR 202

    Excitation of extraordinary modes inside the source of Saturn's kilometric radiation

    Full text link
    The electron cyclotron maser instability (ECMI) of extraordinary mode waves was investigated with the parameters observed in Saturn's kilometric radiation (SKR) sources. Previous studies employed simplified dispersion relations, and did not consider the excitation of the relativistic (R) mode. This mode is introduced by considering the relativistic effect in plasmas consisting of both cold and hot electrons. Using particle-in-cell simulations, we investigated the excitation of R and X modes based on the measured data. Using the reported value of the density ratio of energetic to total electrons ne/n0=24%n_e/n_0=24\%, the most unstable mode is the R mode. The escaping X-mode emissions are amplified only if the energetic electrons are dominant with ne/n090%n_e/n_0 \ge 90\%. For these cases, only the X mode is excited and the R mode disappears due to its strong coupling. The results are well in line with the linear kinetic theory of ECMI. The properties of both the R and X modes are consistent with the observed SKR emissions. This raises questions about the nature of the measured electric field fluctuations within ``presumed'' SKR sources. The study provides new insights into the ECMI process relevant to SKR emission mechanisms

    Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network

    Full text link
    Syntax knowledge contributes its powerful strength in Neural machine translation (NMT) tasks. Early NMT works supposed that syntax details can be automatically learned from numerous texts via attention networks. However, succeeding researches pointed out that limited by the uncontrolled nature of attention computation, the NMT model requires an external syntax to capture the deep syntactic awareness. Although existing syntax-aware NMT methods have bored great fruits in combining syntax, the additional workloads they introduced render the model heavy and slow. Particularly, these efforts scarcely involve the Transformer-based NMT and modify its core self-attention network (SAN). To this end, we propose a parameter-free, dependency-scaled self-attention network (Deps-SAN) for syntax-aware Transformer-based NMT. A quantified matrix of dependency closeness between tokens is constructed to impose explicit syntactic constraints into the SAN for learning syntactic details and dispelling the dispersion of attention distributions. Two knowledge sparsing techniques are further integrated to avoid the model overfitting the dependency noises introduce by the external parser. Experiments and analyses on IWSLT14 German-to-English and WMT16 German-to-English benchmark NMT tasks verify the effectiveness of our approach
    corecore