820,738 research outputs found
RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints
We propose a Convolutional Neural Network (CNN)-based model "RotationNet,"
which takes multi-view images of an object as input and jointly estimates its
pose and object category. Unlike previous approaches that use known viewpoint
labels for training, our method treats the viewpoint labels as latent
variables, which are learned in an unsupervised manner during the training
using an unaligned object dataset. RotationNet is designed to use only a
partial set of multi-view images for inference, and this property makes it
useful in practical scenarios where only partial views are available. Moreover,
our pose alignment strategy enables one to obtain view-specific feature
representations shared across classes, which is important to maintain high
accuracy in both object categorization and pose estimation. Effectiveness of
RotationNet is demonstrated by its superior performance to the state-of-the-art
methods of 3D object classification on 10- and 40-class ModelNet datasets. We
also show that RotationNet, even trained without known poses, achieves the
state-of-the-art performance on an object pose estimation dataset. The code is
available on https://github.com/kanezaki/rotationnetComment: 24 pages, 23 figures. Accepted to CVPR 201
A multi-agent adaptive protocol for femto-satellite applications
Femto-satellites are a very promising category of satellites that weigh less than
100 grams.
Also, a Pico-Rover it is a self-contained robot that weighs less than 1 kilogram and its motion works by rolling the external enclosure that keeps out any environment threats.
The main advantage of this kind of small agents is the multi-point of view when
they work as swarm or taking part of a larger constellation. The complexity of
these kinds of network sensors, in addition to the low power requirements and low size, requires a good strategy of management that we want to present in this work.
The paradigm on management-on-agent consists of a single high quality point of view and multiple low quality points of view where the selection of the point of view is done inside the network but decided externally to the network or done by a basic law. This approach optimizes the bandwidth used by the net.
Instead of streaming every high quality point of view we only stream one of them. At the same time, this approach allows a task distribution on the network where there is only one producer agent, one consumer agent while the rest of agents work as relay nodes.
This work is addressed, on one side, to the design of a simple but robust and adaptive protocol based on this paradigm; on the other hand, an implementation using a low performance platform like the 8051 microcontroller architecture is required
Wireless Network Coding with Local Network Views: Coded Layer Scheduling
One of the fundamental challenges in the design of distributed wireless
networks is the large dynamic range of network state. Since continuous tracking
of global network state at all nodes is practically impossible, nodes can only
acquire limited local views of the whole network to design their transmission
strategies. In this paper, we study multi-layer wireless networks and assume
that each node has only a limited knowledge, namely 1-local view, where each
S-D pair has enough information to perform optimally when other pairs do not
interfere, along with connectivity information for rest of the network. We
investigate the information-theoretic limits of communication with such limited
knowledge at the nodes. We develop a novel transmission strategy, namely Coded
Layer Scheduling, that solely relies on 1-local view at the nodes and
incorporates three different techniques: (1) per layer interference avoidance,
(2) repetition coding to allow overhearing of the interference, and (3) network
coding to allow interference neutralization. We show that our proposed scheme
can provide a significant throughput gain compared with the conventional
interference avoidance strategies. Furthermore, we show that our strategy
maximizes the achievable normalized sum-rate for some classes of networks,
hence, characterizing the normalized sum-capacity of those networks with
1-local view.Comment: Technical report. A paper based on the results of this report will
appea
Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images
We propose a novel transformer-based framework that reconstructs two high
fidelity hands from multi-view RGB images. Unlike existing hand pose estimation
methods, where one typically trains a deep network to regress hand model
parameters from single RGB image, we consider a more challenging problem
setting where we directly regress the absolute root poses of two-hands with
extended forearm at high resolution from egocentric view. As existing datasets
are either infeasible for egocentric viewpoints or lack background variations,
we create a large-scale synthetic dataset with diverse scenarios and collect a
real dataset from multi-calibrated camera setup to verify our proposed
multi-view image feature fusion strategy. To make the reconstruction physically
plausible, we propose two strategies: (i) a coarse-to-fine spectral graph
convolution decoder to smoothen the meshes during upsampling and (ii) an
optimisation-based refinement stage at inference to prevent self-penetrations.
Through extensive quantitative and qualitative evaluations, we show that our
framework is able to produce realistic two-hand reconstructions and demonstrate
the generalisation of synthetic-trained models to real data, as well as
real-time AR/VR applications.Comment: Accepted to ICCV 202
Rethinking Range View Representation for LiDAR Segmentation
LiDAR segmentation is crucial for autonomous driving perception. Recent
trends favor point- or voxel-based methods as they often yield better
performance than the traditional range view representation. In this work, we
unveil several key factors in building powerful range view models. We observe
that the "many-to-one" mapping, semantic incoherence, and shape deformation are
possible impediments against effective learning from range view projections. We
present RangeFormer -- a full-cycle framework comprising novel designs across
network architecture, data augmentation, and post-processing -- that better
handles the learning and processing of LiDAR point clouds from the range view.
We further introduce a Scalable Training from Range view (STR) strategy that
trains on arbitrary low-resolution 2D range images, while still maintaining
satisfactory 3D segmentation accuracy. We show that, for the first time, a
range view method is able to surpass the point, voxel, and multi-view fusion
counterparts in the competing LiDAR semantic and panoptic segmentation
benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI.Comment: ICCV 2023; 24 pages, 10 figures, 14 tables; Webpage at
https://ldkong.com/RangeForme
Multi-scale stamps for real-time classification of alert streams
In recent years, automatic classifiers of image cutouts (also called
"stamps") have shown to be key for fast supernova discovery. The upcoming Vera
C. Rubin Observatory will distribute about ten million alerts with their
respective stamps each night, which it is expected to enable the discovery of
approximately one million supernovae each year. A growing source of confusion
for these classifiers is the presence of satellite glints, sequences of
point-like-sources produced by rotating satellites or debris. The currently
planned Rubin stamps will have a size smaller than the typical separation
between these point sources. Thus, a larger field of view image stamp could
enable the automatic identification of these sources. However, the distribution
of larger field of view stamps would be limited by network bandwidth
restrictions. We evaluate the impact of using image stamps of different angular
sizes and resolutions for the fast classification of events (AGNs, asteroids,
bogus, satellites, SNe, and variable stars), using available data from the
Zwicky Transient Facility survey. We compare four scenarios: three with the
same number of pixels (small field of view with high resolution, large field of
view with low resolution, and a proposed multi-scale strategy) and a scenario
with the full ZTF stamp that has a larger field of view and higher resolution.
Our multi-scale proposal outperforms all the scenarios, with a macro f1-score
of 87.39. We encourage Rubin and its Science Collaborations to consider the
benefits of implementing multi-scale stamps as a possible update to the alert
specification.Comment: Submitted to ApJ
Continual Adaptation of Semantic Segmentation using Complementary 2D-3D Data Representations
Semantic segmentation networks are usually pre-trained once and not updated
during deployment. As a consequence, misclassifications commonly occur if the
distribution of the training data deviates from the one encountered during the
robot's operation. We propose to mitigate this problem by adapting the neural
network to the robot's environment during deployment, without any need for
external supervision. Leveraging complementary data representations, we
generate a supervision signal, by probabilistically accumulating consecutive 2D
semantic predictions in a volumetric 3D map. We then train the network on
renderings of the accumulated semantic map, effectively resolving ambiguities
and enforcing multi-view consistency through the 3D representation. In contrast
to scene adaptation methods, we aim to retain the previously-learned knowledge,
and therefore employ a continual learning experience replay strategy to adapt
the network. Through extensive experimental evaluation, we show successful
adaptation to real-world indoor scenes both on the ScanNet dataset and on
in-house data recorded with an RGB-D sensor. Our method increases the
segmentation accuracy on average by 9.9% compared to the fixed pre-trained
neural network, while retaining knowledge from the pre-training dataset.Comment: Accepted for IEEE Robotics and Automation Letters (R-AL 2022
LineMarkNet: Line Landmark Detection for Valet Parking
We aim for accurate and efficient line landmark detection for valet parking,
which is a long-standing yet unsolved problem in autonomous driving. To this
end, we present a deep line landmark detection system where we carefully design
the modules to be lightweight. Specifically, we first empirically design four
general line landmarks including three physical lines and one novel mental
line. The four line landmarks are effective for valet parking. We then develop
a deep network (LineMarkNet) to detect line landmarks from surround-view
cameras where we, via the pre-calibrated homography, fuse context from four
separate cameras into the unified bird-eye-view (BEV) space, specifically we
fuse the surroundview features and BEV features, then employ the multi-task
decoder to detect multiple line landmarks where we apply the center-based
strategy for object detection task, and design our graph transformer to enhance
the vision transformer with hierarchical level graph reasoning for semantic
segmentation task. At last, we further parameterize the detected line landmarks
(e.g., intercept-slope form) whereby a novel filtering backend incorporates
temporal and multi-view consistency to achieve smooth and stable detection.
Moreover, we annotate a large-scale dataset to validate our method.
Experimental results show that our framework achieves the enhanced performance
compared with several line detection methods and validate the multi-task
network's efficiency about the real-time line landmark detection on the
Qualcomm 820A platform while meantime keeps superior accuracy, with our deep
line landmark detection system.Comment: 29 pages, 12 figure
- …