820,738 research outputs found

    RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints

    Full text link
    We propose a Convolutional Neural Network (CNN)-based model "RotationNet," which takes multi-view images of an object as input and jointly estimates its pose and object category. Unlike previous approaches that use known viewpoint labels for training, our method treats the viewpoint labels as latent variables, which are learned in an unsupervised manner during the training using an unaligned object dataset. RotationNet is designed to use only a partial set of multi-view images for inference, and this property makes it useful in practical scenarios where only partial views are available. Moreover, our pose alignment strategy enables one to obtain view-specific feature representations shared across classes, which is important to maintain high accuracy in both object categorization and pose estimation. Effectiveness of RotationNet is demonstrated by its superior performance to the state-of-the-art methods of 3D object classification on 10- and 40-class ModelNet datasets. We also show that RotationNet, even trained without known poses, achieves the state-of-the-art performance on an object pose estimation dataset. The code is available on https://github.com/kanezaki/rotationnetComment: 24 pages, 23 figures. Accepted to CVPR 201

    A multi-agent adaptive protocol for femto-satellite applications

    Get PDF
    Femto-satellites are a very promising category of satellites that weigh less than 100 grams. Also, a Pico-Rover it is a self-contained robot that weighs less than 1 kilogram and its motion works by rolling the external enclosure that keeps out any environment threats. The main advantage of this kind of small agents is the multi-point of view when they work as swarm or taking part of a larger constellation. The complexity of these kinds of network sensors, in addition to the low power requirements and low size, requires a good strategy of management that we want to present in this work. The paradigm on management-on-agent consists of a single high quality point of view and multiple low quality points of view where the selection of the point of view is done inside the network but decided externally to the network or done by a basic law. This approach optimizes the bandwidth used by the net. Instead of streaming every high quality point of view we only stream one of them. At the same time, this approach allows a task distribution on the network where there is only one producer agent, one consumer agent while the rest of agents work as relay nodes. This work is addressed, on one side, to the design of a simple but robust and adaptive protocol based on this paradigm; on the other hand, an implementation using a low performance platform like the 8051 microcontroller architecture is required

    Wireless Network Coding with Local Network Views: Coded Layer Scheduling

    Full text link
    One of the fundamental challenges in the design of distributed wireless networks is the large dynamic range of network state. Since continuous tracking of global network state at all nodes is practically impossible, nodes can only acquire limited local views of the whole network to design their transmission strategies. In this paper, we study multi-layer wireless networks and assume that each node has only a limited knowledge, namely 1-local view, where each S-D pair has enough information to perform optimally when other pairs do not interfere, along with connectivity information for rest of the network. We investigate the information-theoretic limits of communication with such limited knowledge at the nodes. We develop a novel transmission strategy, namely Coded Layer Scheduling, that solely relies on 1-local view at the nodes and incorporates three different techniques: (1) per layer interference avoidance, (2) repetition coding to allow overhearing of the interference, and (3) network coding to allow interference neutralization. We show that our proposed scheme can provide a significant throughput gain compared with the conventional interference avoidance strategies. Furthermore, we show that our strategy maximizes the achievable normalized sum-rate for some classes of networks, hence, characterizing the normalized sum-capacity of those networks with 1-local view.Comment: Technical report. A paper based on the results of this report will appea

    Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

    Full text link
    We propose a novel transformer-based framework that reconstructs two high fidelity hands from multi-view RGB images. Unlike existing hand pose estimation methods, where one typically trains a deep network to regress hand model parameters from single RGB image, we consider a more challenging problem setting where we directly regress the absolute root poses of two-hands with extended forearm at high resolution from egocentric view. As existing datasets are either infeasible for egocentric viewpoints or lack background variations, we create a large-scale synthetic dataset with diverse scenarios and collect a real dataset from multi-calibrated camera setup to verify our proposed multi-view image feature fusion strategy. To make the reconstruction physically plausible, we propose two strategies: (i) a coarse-to-fine spectral graph convolution decoder to smoothen the meshes during upsampling and (ii) an optimisation-based refinement stage at inference to prevent self-penetrations. Through extensive quantitative and qualitative evaluations, we show that our framework is able to produce realistic two-hand reconstructions and demonstrate the generalisation of synthetic-trained models to real data, as well as real-time AR/VR applications.Comment: Accepted to ICCV 202

    Rethinking Range View Representation for LiDAR Segmentation

    Full text link
    LiDAR segmentation is crucial for autonomous driving perception. Recent trends favor point- or voxel-based methods as they often yield better performance than the traditional range view representation. In this work, we unveil several key factors in building powerful range view models. We observe that the "many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer -- a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing -- that better handles the learning and processing of LiDAR point clouds from the range view. We further introduce a Scalable Training from Range view (STR) strategy that trains on arbitrary low-resolution 2D range images, while still maintaining satisfactory 3D segmentation accuracy. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI.Comment: ICCV 2023; 24 pages, 10 figures, 14 tables; Webpage at https://ldkong.com/RangeForme

    Multi-scale stamps for real-time classification of alert streams

    Full text link
    In recent years, automatic classifiers of image cutouts (also called "stamps") have shown to be key for fast supernova discovery. The upcoming Vera C. Rubin Observatory will distribute about ten million alerts with their respective stamps each night, which it is expected to enable the discovery of approximately one million supernovae each year. A growing source of confusion for these classifiers is the presence of satellite glints, sequences of point-like-sources produced by rotating satellites or debris. The currently planned Rubin stamps will have a size smaller than the typical separation between these point sources. Thus, a larger field of view image stamp could enable the automatic identification of these sources. However, the distribution of larger field of view stamps would be limited by network bandwidth restrictions. We evaluate the impact of using image stamps of different angular sizes and resolutions for the fast classification of events (AGNs, asteroids, bogus, satellites, SNe, and variable stars), using available data from the Zwicky Transient Facility survey. We compare four scenarios: three with the same number of pixels (small field of view with high resolution, large field of view with low resolution, and a proposed multi-scale strategy) and a scenario with the full ZTF stamp that has a larger field of view and higher resolution. Our multi-scale proposal outperforms all the scenarios, with a macro f1-score of 87.39. We encourage Rubin and its Science Collaborations to consider the benefits of implementing multi-scale stamps as a possible update to the alert specification.Comment: Submitted to ApJ

    Continual Adaptation of Semantic Segmentation using Complementary 2D-3D Data Representations

    Full text link
    Semantic segmentation networks are usually pre-trained once and not updated during deployment. As a consequence, misclassifications commonly occur if the distribution of the training data deviates from the one encountered during the robot's operation. We propose to mitigate this problem by adapting the neural network to the robot's environment during deployment, without any need for external supervision. Leveraging complementary data representations, we generate a supervision signal, by probabilistically accumulating consecutive 2D semantic predictions in a volumetric 3D map. We then train the network on renderings of the accumulated semantic map, effectively resolving ambiguities and enforcing multi-view consistency through the 3D representation. In contrast to scene adaptation methods, we aim to retain the previously-learned knowledge, and therefore employ a continual learning experience replay strategy to adapt the network. Through extensive experimental evaluation, we show successful adaptation to real-world indoor scenes both on the ScanNet dataset and on in-house data recorded with an RGB-D sensor. Our method increases the segmentation accuracy on average by 9.9% compared to the fixed pre-trained neural network, while retaining knowledge from the pre-training dataset.Comment: Accepted for IEEE Robotics and Automation Letters (R-AL 2022

    LineMarkNet: Line Landmark Detection for Valet Parking

    Full text link
    We aim for accurate and efficient line landmark detection for valet parking, which is a long-standing yet unsolved problem in autonomous driving. To this end, we present a deep line landmark detection system where we carefully design the modules to be lightweight. Specifically, we first empirically design four general line landmarks including three physical lines and one novel mental line. The four line landmarks are effective for valet parking. We then develop a deep network (LineMarkNet) to detect line landmarks from surround-view cameras where we, via the pre-calibrated homography, fuse context from four separate cameras into the unified bird-eye-view (BEV) space, specifically we fuse the surroundview features and BEV features, then employ the multi-task decoder to detect multiple line landmarks where we apply the center-based strategy for object detection task, and design our graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation task. At last, we further parameterize the detected line landmarks (e.g., intercept-slope form) whereby a novel filtering backend incorporates temporal and multi-view consistency to achieve smooth and stable detection. Moreover, we annotate a large-scale dataset to validate our method. Experimental results show that our framework achieves the enhanced performance compared with several line detection methods and validate the multi-task network's efficiency about the real-time line landmark detection on the Qualcomm 820A platform while meantime keeps superior accuracy, with our deep line landmark detection system.Comment: 29 pages, 12 figure
    corecore