68 research outputs found
SelectionConv: Convolutional Neural Networks for Non-rectilinear Image Data
Convolutional Neural Networks have revolutionized vision applications. There
are image domains and representations, however, that cannot be handled by
standard CNNs (e.g., spherical images, superpixels). Such data are usually
processed using networks and algorithms specialized for each type. In this
work, we show that it may not always be necessary to use specialized neural
networks to operate on such spaces. Instead, we introduce a new structured
graph convolution operator that can copy 2D convolution weights, transferring
the capabilities of already trained traditional CNNs to our new graph network.
This network can then operate on any data that can be represented as a
positional graph. By converting non-rectilinear data to a graph, we can apply
these convolutions on these irregular image domains without requiring training
on large domain-specific datasets. Results of transferring pre-trained image
networks for segmentation, stylization, and depth prediction are demonstrated
for a variety of such data forms.Comment: To be presented at ECCV 202
GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation
Recently, graph-based and Transformer-based deep learning networks have
demonstrated excellent performances on various point cloud tasks. Most of the
existing graph methods are based on static graph, which take a fixed input to
establish graph relations. Moreover, many graph methods apply maximization and
averaging to aggregate neighboring features, so that only a single neighboring
point affects the feature of centroid or different neighboring points have the
same influence on the centroid's feature, which ignoring the correlation and
difference between points. Most Transformer-based methods extract point cloud
features based on global attention and lack the feature learning on local
neighbors. To solve the problems of these two types of models, we propose a new
feature extraction block named Graph Transformer and construct a 3D point point
cloud learning network called GTNet to learn features of point clouds on local
and global patterns. Graph Transformer integrates the advantages of graph-based
and Transformer-based methods, and consists of Local Transformer and Global
Transformer modules. Local Transformer uses a dynamic graph to calculate all
neighboring point weights by intra-domain cross-attention with dynamically
updated graph relations, so that every neighboring point could affect the
features of centroid with different weights; Global Transformer enlarges the
receptive field of Local Transformer by a global self-attention. In addition,
to avoid the disappearance of the gradient caused by the increasing depth of
network, we conduct residual connection for centroid features in GTNet; we also
adopt the features of centroid and neighbors to generate the local geometric
descriptors in Local Transformer to strengthen the local information learning
capability of the model. Finally, we use GTNet for shape classification, part
segmentation and semantic segmentation tasks in this paper
Variational Relational Point Completion Network for Robust 3D Classification
Real-scanned point clouds are often incomplete due to viewpoint, occlusion,
and noise, which hampers 3D geometric modeling and perception. Existing point
cloud completion methods tend to generate global shape skeletons and hence lack
fine local details. Furthermore, they mostly learn a deterministic
partial-to-complete mapping, but overlook structural relations in man-made
objects. To tackle these challenges, this paper proposes a variational
framework, Variational Relational point Completion Network (VRCNet) with two
appealing properties: 1) Probabilistic Modeling. In particular, we propose a
dual-path architecture to enable principled probabilistic modeling across
partial and complete clouds. One path consumes complete point clouds for
reconstruction by learning a point VAE. The other path generates complete
shapes for partial point clouds, whose embedded distribution is guided by
distribution obtained from the reconstruction path during training. 2)
Relational Enhancement. Specifically, we carefully design point self-attention
kernel and point selective kernel module to exploit relational point features,
which refines local shape details conditioned on the coarse completion. In
addition, we contribute multi-view partial point cloud datasets (MVP and MVP-40
dataset) containing over 200,000 high-quality scans, which render partial 3D
shapes from 26 uniformly distributed camera poses for each 3D CAD model.
Extensive experiments demonstrate that VRCNet outperforms state-of-the-art
methods on all standard point cloud completion benchmarks. Notably, VRCNet
shows great generalizability and robustness on real-world point cloud scans.
Moreover, we can achieve robust 3D classification for partial point clouds with
the help of VRCNet, which can highly increase classification accuracy.Comment: 12 pages, 10 figures, accepted by PAMI. project webpage:
https://mvp-dataset.github.io/. arXiv admin note: substantial text overlap
with arXiv:2104.1015
IoTNet: An Efficient and Accurate Convolutional Neural Network for IoT Devices
Two main approaches exist when deploying a Convolutional Neural Network (CNN) on resource-constrained IoT devices: either scale a large model down or use a small model designed specifically for resource-constrained environments. Small architectures typically trade accuracy for computational cost by performing convolutions as depth-wise convolutions rather than standard convolutions like in large networks. Large models focus primarily on state-of-the-art performance and often struggle to scale down sufficiently. We propose a new model, namely IoTNet, designed for resource-constrained environments which achieves state-of-the-art performance within the domain of small efficient models. IoTNet trades accuracy with computational cost differently from existing methods by factorizing standard 3 × 3 convolutions into pairs of 1 × 3 and 3 × 1 standard convolutions, rather than performing depth-wise convolutions. We benchmark IoTNet against state-of-the-art efficiency-focused models and scaled-down large architectures on data sets which best match the complexity of problems faced in resource-constrained environments. We compare model accuracy and the number of floating-point operations (FLOPs) performed as a measure of efficiency. We report state-of-the-art accuracy improvement over MobileNetV2 on CIFAR-10 of 13.43 with 39 fewer FLOPs, over ShuffleNet on Street View House Numbers (SVHN) of 6.49 with 31.8 fewer FLOPs and over MobileNet on German Traffic Sign Recognition Benchmark (GTSRB) of 5 with 0.38 fewer FLOPs
Attribute Artifacts Removal for Geometry-based Point Cloud Compression
Geometry-based point cloud compression (G-PCC) can achieve remarkable
compression efficiency for point clouds. However, it still leads to serious
attribute compression artifacts, especially under low bitrate scenarios. In
this paper, we propose a Multi-Scale Graph Attention Network (MS-GAT) to remove
the artifacts of point cloud attributes compressed by G-PCC. We first construct
a graph based on point cloud geometry coordinates and then use the Chebyshev
graph convolutions to extract features of point cloud attributes. Considering
that one point may be correlated with points both near and far away from it, we
propose a multi-scale scheme to capture the short- and long-range correlations
between the current point and its neighboring and distant points. To address
the problem that various points may have different degrees of artifacts caused
by adaptive quantization, we introduce the quantization step per point as an
extra input to the proposed network. We also incorporate a weighted graph
attentional layer into the network to pay special attention to the points with
more attribute artifacts. To the best of our knowledge, this is the first
attribute artifacts removal method for G-PCC. We validate the effectiveness of
our method over various point clouds. Objective comparison results show that
our proposed method achieves an average of 9.74% BD-rate reduction compared
with Predlift and 10.13% BD-rate reduction compared with RAHT. Subjective
comparison results present that visual artifacts such as color shifting,
blurring, and quantization noise are reduced
Mix-GEMM: An efficient HW-SW architecture for mixed-precision quantized deep neural networks inference on edge devices
Deep Neural Network (DNN) inference based on quantized narrow-precision integer data represents a promising research direction toward efficient deep learning computations on edge and mobile devices. On one side, recent progress of Quantization-Aware Training (QAT) frameworks aimed at improving the accuracy of extremely quantized DNNs allows achieving results close to Floating-Point 32 (FP32), and provides high flexibility concerning the data sizes selection. Unfortunately, current Central Processing Unit (CPU) architectures and Instruction Set Architectures (ISAs) targeting resource-constrained devices present limitations on the range of data sizes supported to compute DNN kernels.This paper presents Mix-GEMM, a hardware-software co-designed architecture capable of efficiently computing quantized DNN convolutional kernels based on byte and sub-byte data sizes. Mix-GEMM accelerates General Matrix Multiplication (GEMM), representing the core kernel of DNNs, supporting all data size combinations from 8- to 2-bit, including mixed-precision computations, and featuring performance that scale with the decreasing of the computational data sizes. Our experimental evaluation, performed on representative quantized Convolutional Neural Networks (CNNs), shows that a RISC-V based edge System-on-Chip (SoC) integrating Mix-GEMM achieves up to 1.3 TOPS/W in energy efficiency, and up to 13.6 GOPS in throughput, gaining from 5.3× to 15.1× in performance over the OpenBLAS GEMM frameworks running on a commercial RISC-V based edge processor. By performing synthesis and Place and Route (PnR) of the enhanced SoC in Global Foundries 22nm FDX technology, we show that Mix-GEMM only accounts for 1% of the overall area consumption.This research was supported by the ERDF Operational Program of Catalonia 2014-2020, with a grant from the Spanish State Research Agency [PID2019-107255GB] and with DRAC project [001-P-001723], by the grant [PID2019-107255G-C21] funded by MCIN/AEI/ 10.13039/501100011033, by the Generalitat de Catalunya [2017-SGR-1328], and by Lenovo-BSC Contract-Framework (2020). The Spanish Ministry of Economy, Industry and Competitiveness has partially supported M. Doblas through an FPU fellowship [FPU20-04076] and M. Moreto through a Ramon y Cajal fellowship [RYC-2016-21104].Peer ReviewedPostprint (author's final draft
Deep learning methods for 360 monocular depth estimation and point cloud semantic segmentation
Monocular depth estimation and point cloud segmentation are essential tasks for 3D scene understanding in computer vision. Depth estimation for omnidirectional images is challenging due to the spherical distortion issue and the availability of large-scale labeled datasets. We propose two separate works for 360 monocular depth estimation tasks. In the first work, we propose a novel, model-agnostic, two-stage pipeline for omnidirectional monocular depth estimation. Our proposed framework PanoDepth takes one 360 image as input, produces one or more synthesized views in the first stage, and feeds the original image and the synthesized images into the subsequent stereo matching stage. Utilizing the explicit stereo-based geometric constraints, PanoDepth can generate dense high-quality depth. In the second work, we propose a 360 monocular depth estimation pipeline, OmniFusion, to tackle the spherical distortion issue. Our pipeline transforms a 360 image into less-distorted perspective patches (i.e. tangent images) to obtain patch-wise predictions via CNN, and then merge the patch-wise results for final output. To handle the discrepancy between patch-wise predictions which is a major issue affecting the merging quality, we propose a new framework with (i) a geometry-aware feature fusion mechanism that combines 3D geometric features with 2D image features. (ii) the self-attention-based transformer architecture to conduct a global aggregation of patch-wise information. (iii) an iterative depth refinement mechanism to further refine the estimated depth based on the more accurate geometric features. Experiments show that both PanoDepth and OmniFusion achieve state-of-the-art performances on several 360 monocular depth estimation benchmark datasets. For point cloud analysis, we mainly focus on defining effective local point convolution operators. We propose two approaches, SPNet and Point-Voxel CNN respectively. For the former, we propose a novel point convolution operator named Shell Point Convolution (SPConv) as the building block for shape encoding and local context learning. Specifically, SPConv splits 3D neighborhood space into shells, aggregates local features on manually designed kernel points, and performs convolution on the shells. For the latter, we present a novel lightweight convolutional neural network which uses point voxel convolution (PVC) layer as building block. Each PVC layer has two parallel branches, namely the voxel branch and the point branch. For the voxel branch, we aggregate local features on non-empty voxel centers to reduce geometric information loss caused by voxelization, then apply volumetric convolutions to enhance local neighborhood geometry encoding. For the point branch, we use Multi-Layer Perceptron (MLP) to extract fine-detailed point-wise features. Outputs from these two branches are adaptively fused via a feature selection module. Experimental results show that SPConv and PVC layers are effective in local shape encoding, and our proposed networks perform well in semantic segmentation tasks.Includes bibliographical references
Causes of Catastrophic Forgetting in Class-Incremental Semantic Segmentation
Class-incremental learning for semantic segmentation (CiSS) is presently a
highly researched field which aims at updating a semantic segmentation model by
sequentially learning new semantic classes. A major challenge in CiSS is
overcoming the effects of catastrophic forgetting, which describes the sudden
drop of accuracy on previously learned classes after the model is trained on a
new set of classes. Despite latest advances in mitigating catastrophic
forgetting, the underlying causes of forgetting specifically in CiSS are not
well understood. Therefore, in a set of experiments and representational
analyses, we demonstrate that the semantic shift of the background class and a
bias towards new classes are the major causes of forgetting in CiSS.
Furthermore, we show that both causes mostly manifest themselves in deeper
classification layers of the network, while the early layers of the model are
not affected. Finally, we demonstrate how both causes are effectively mitigated
utilizing the information contained in the background, with the help of
knowledge distillation and an unbiased cross-entropy loss.Comment: currently under revie
Geometric Feature Learning for 3D Meshes
Geometric feature learning for 3D meshes is central to computer graphics and
highly important for numerous vision applications. However, deep learning
currently lags in hierarchical modeling of heterogeneous 3D meshes due to the
lack of required operations and/or their efficient implementations. In this
paper, we propose a series of modular operations for effective geometric deep
learning over heterogeneous 3D meshes. These operations include mesh
convolutions, (un)pooling and efficient mesh decimation. We provide open source
implementation of these operations, collectively termed \textit{Picasso}. The
mesh decimation module of Picasso is GPU-accelerated, which can process a batch
of meshes on-the-fly for deep learning. Our (un)pooling operations compute
features for newly-created neurons across network layers of varying resolution.
Our mesh convolutions include facet2vertex, vertex2facet, and facet2facet
convolutions that exploit vMF mixture and Barycentric interpolation to
incorporate fuzzy modelling. Leveraging the modular operations of Picasso, we
contribute a novel hierarchical neural network, PicassoNet-II, to learn highly
discriminative features from 3D meshes. PicassoNet-II accepts primitive
geometrics and fine textures of mesh facets as input features, while processing
full scene meshes. Our network achieves highly competitive performance for
shape analysis and scene parsing on a variety of benchmarks. We release Picasso
and PicassoNet-II on Github https://github.com/EnyaHermite/Picasso.Comment: Submitted to TPAM
- …