418 research outputs found
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding
Learning representations through self-supervision on unlabeled data has
proven highly effective for understanding diverse images. However, remote
sensing images often have complex and densely populated scenes with multiple
land objects and no clear foreground objects. This intrinsic property generates
high object density, resulting in false positive pairs or missing contextual
information in self-supervised learning. To address these problems, we propose
a context-enhanced masked image modeling method (CtxMIM), a simple yet
efficient MIM-based self-supervised learning for remote sensing image
understanding. CtxMIM formulates original image patches as a reconstructive
template and employs a Siamese framework to operate on two sets of image
patches. A context-enhanced generative branch is introduced to provide
contextual information through context consistency constraints in the
reconstruction. With the simple and elegant design, CtxMIM encourages the
pre-training model to learn object-level or pixel-level features on a
large-scale dataset without specific temporal or geographical constraints.
Finally, extensive experiments show that features learned by CtxMIM outperform
fully supervised and state-of-the-art self-supervised learning methods on
various downstream tasks, including land cover classification, semantic
segmentation, object detection, and instance segmentation. These results
demonstrate that CtxMIM learns impressive remote sensing representations with
high generalization and transferability. Code and data will be made public
available
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection
Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a
feasible solution for economical autonomous driving. However, the existing
BEV-based multi-view 3D detectors generally transform all image features into
BEV features, without considering the problem that the large proportion of
background information may submerge the object information. In this paper, we
propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out
background information according to the semantic segmentation of image features
and transform image features into semantic-aware BEV features. Accordingly, we
propose BEV-Paste, an effective data augmentation strategy that closely matches
with semantic-aware BEV feature. In addition, we design a Multi-Scale
Cross-Task (MSCT) head, which combines task-specific and cross-task information
to predict depth distribution and semantic segmentation more accurately,
further improving the quality of semantic-aware BEV feature. Finally, we
integrate the above modules into a novel multi-view 3D object detection
framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves
state-of-the-art performance. Code has been available at
https://github.com/mengtan00/SA-BEV.git
Compressive Sequential Learning for Action Similarity Labeling
Human action recognition in videos has been extensively studied in recent years due to its wide range of applications. Instead of classifying video sequences into a number of action categories, in this paper, we focus on a particular problem of action similarity labeling (ASLAN), which aims at verifying whether a pair of videos contain the same type of action or not. To address this challenge, a novel approach called compressive sequential learning (CSL) is proposed by leveraging the compressive sensing theory and sequential learning. We first project data points to a low-dimensional space by effectively exploring an important property in compressive sensing: the restricted isometry property. In particular, a very sparse measurement matrix is adopted to reduce the dimensionality efficiently. We then learn an ensemble classifier for measuring similarities between pairwise videos by iteratively minimizing its empirical risk with the AdaBoost strategy on the training set. Unlike conventional AdaBoost, the weak learner for each iteration is not explicitly defined and its parameters are learned through greedy optimization. Furthermore, an alternative of CSL named compressive sequential encoding is developed as an encoding technique and followed by a linear classifier to address the similarity-labeling problem. Our method has been systematically evaluated on four action data sets: ASLAN, KTH, HMDB51, and Hollywood2, and the results show the effectiveness and superiority of our method for ASLAN
Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection
Anomaly detection (AD), aiming to find samples that deviate from the training
distribution, is essential in safety-critical applications. Though recent
self-supervised learning based attempts achieve promising results by creating
virtual outliers, their training objectives are less faithful to AD which
requires a concentrated inlier distribution as well as a dispersive outlier
distribution. In this paper, we propose Unilaterally Aggregated Contrastive
Learning with Hierarchical Augmentation (UniCon-HA), taking into account both
the requirements above. Specifically, we explicitly encourage the concentration
of inliers and the dispersion of virtual outliers via supervised and
unsupervised contrastive losses, respectively. Considering that standard
contrastive data augmentation for generating positive views may induce
outliers, we additionally introduce a soft mechanism to re-weight each
augmented inlier according to its deviation from the inlier distribution, to
ensure a purified concentration. Moreover, to prompt a higher concentration,
inspired by curriculum learning, we adopt an easy-to-hard hierarchical
augmentation strategy and perform contrastive aggregation at different depths
of the network based on the strengths of data augmentation. Our method is
evaluated under three AD settings including unlabeled one-class, unlabeled
multi-class, and labeled multi-class, demonstrating its consistent superiority
over other competitors.Comment: Accepted by ICCV'202
- …