257 research outputs found
Masked Supervised Learning for Semantic Segmentation
Self-attention is of vital importance in semantic segmentation as it enables
modeling of long-range context, which translates into improved performance. We
argue that it is equally important to model short-range context, especially to
tackle cases where not only the regions of interest are small and ambiguous,
but also when there exists an imbalance between the semantic classes. To this
end, we propose Masked Supervised Learning (MaskSup), an effective single-stage
learning paradigm that models both short- and long-range context, capturing the
contextual relationships between pixels via random masking. Experimental
results demonstrate the competitive performance of MaskSup against strong
baselines in both binary and multi-class segmentation tasks on three standard
benchmark datasets, particularly at handling ambiguous regions and retaining
better segmentation of minority classes with no added inference cost. In
addition to segmenting target regions even when large portions of the input are
masked, MaskSup is also generic and can be easily integrated into a variety of
semantic segmentation methods. We also show that the proposed method is
computationally efficient, yielding an improved performance by 10\% on the mean
intersection-over-union (mIoU) while requiring less learnable
parameters
Iterative Graph Filtering Network for 3D Human Pose Estimation
Graph convolutional networks (GCNs) have proven to be an effective approach
for 3D human pose estimation. By naturally modeling the skeleton structure of
the human body as a graph, GCNs are able to capture the spatial relationships
between joints and learn an efficient representation of the underlying pose.
However, most GCN-based methods use a shared weight matrix, making it
challenging to accurately capture the different and complex relationships
between joints. In this paper, we introduce an iterative graph filtering
framework for 3D human pose estimation, which aims to predict the 3D joint
positions given a set of 2D joint locations in images. Our approach builds upon
the idea of iteratively solving graph filtering with Laplacian regularization
via the Gauss-Seidel iterative method. Motivated by this iterative solution, we
design a Gauss-Seidel network (GS-Net) architecture, which makes use of weight
and adjacency modulation, skip connection, and a pure convolutional block with
layer normalization. Adjacency modulation facilitates the learning of edges
that go beyond the inherent connections of body joints, resulting in an
adjusted graph structure that reflects the human skeleton, while skip
connections help maintain crucial information from the input layer's initial
features as the network depth increases. We evaluate our proposed model on two
standard benchmark datasets, and compare it with a comprehensive set of strong
baseline methods for 3D human pose estimation. Our experimental results
demonstrate that our approach outperforms the baseline methods on both
datasets, achieving state-of-the-art performance. Furthermore, we conduct
ablation studies to analyze the contributions of different components of our
model architecture and show that the skip connection and adjacency modulation
help improve the model performance
Graph Fairing Convolutional Networks for Anomaly Detection
Graph convolution is a fundamental building block for many deep neural
networks on graph-structured data. In this paper, we introduce a simple, yet
very effective graph convolutional network with skip connections for
semi-supervised anomaly detection. The proposed layerwise propagation rule of
our model is theoretically motivated by the concept of implicit fairing in
geometry processing, and comprises a graph convolution module for aggregating
information from immediate node neighbors and a skip connection module for
combining layer-wise neighborhood representations. This propagation rule is
derived from the iterative solution of the implicit fairing equation via the
Jacobi method. In addition to capturing information from distant graph nodes
through skip connections between the network's layers, our approach exploits
both the graph structure and node features for learning discriminative node
representations. These skip connections are integrated by design in our
proposed network architecture. The effectiveness of our model is demonstrated
through extensive experiments on five benchmark datasets, achieving better or
comparable anomaly detection results against strong baseline methods. We also
demonstrate through an ablation study that skip connection helps improve the
model performance
A Graph Encoder-Decoder Network for Unsupervised Anomaly Detection
A key component of many graph neural networks (GNNs) is the pooling
operation, which seeks to reduce the size of a graph while preserving important
structural information. However, most existing graph pooling strategies rely on
an assignment matrix obtained by employing a GNN layer, which is characterized
by trainable parameters, often leading to significant computational complexity
and a lack of interpretability in the pooling process. In this paper, we
propose an unsupervised graph encoder-decoder model to detect abnormal nodes
from graphs by learning an anomaly scoring function to rank nodes based on
their degree of abnormality. In the encoding stage, we design a novel pooling
mechanism, named LCPool, which leverages locality-constrained linear coding for
feature encoding to find a cluster assignment matrix by solving a least-squares
optimization problem with a locality regularization term. By enforcing locality
constraints during the coding process, LCPool is designed to be free from
learnable parameters, capable of efficiently handling large graphs, and can
effectively generate a coarser graph representation while retaining the most
significant structural characteristics of the graph. In the decoding stage, we
propose an unpooling operation, called LCUnpool, to reconstruct both the
structure and nodal features of the original graph. We conduct empirical
evaluations of our method on six benchmark datasets using several evaluation
metrics, and the results demonstrate its superiority over state-of-the-art
anomaly detection approaches
Learning to recognize occluded and small objects with partial inputs
Recognizing multiple objects in an image is challenging due to occlusions,
and becomes even more so when the objects are small. While promising, existing
multi-label image recognition models do not explicitly learn context-based
representations, and hence struggle to correctly recognize small and occluded
objects. Intuitively, recognizing occluded objects requires knowledge of
partial input, and hence context. Motivated by this intuition, we propose
Masked Supervised Learning (MSL), a single-stage, model-agnostic learning
paradigm for multi-label image recognition. The key idea is to learn
context-based representations using a masked branch and to model label
co-occurrence using label consistency. Experimental results demonstrate the
simplicity, applicability and more importantly the competitive performance of
MSL against previous state-of-the-art methods on standard multi-label image
recognition benchmarks. In addition, we show that MSL is robust to random
masking and demonstrate its effectiveness in recognizing non-masked objects.
Code and pretrained models are available on GitHub
- …