17 research outputs found
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer
Deep learning (DL) has advanced the field of dense prediction, while
gradually dissolving the inherent barriers between different tasks. However,
most existing works focus on designing architectures and constructing visual
cues only for the specific task, which ignores the potential uniformity
introduced by the DL paradigm. In this paper, we attempt to construct a novel
\underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse
bi-source dense prediction tasks. Specifically, unlike existing methods that
over-specialize in a single task or a subset of tasks, ComPtr starts from the
more general concept of bi-source dense prediction. Based on the basic
dependence on information complementarity, we propose consistency enhancement
and difference awareness components with which ComPtr can evacuate and collect
important visual semantic cues from different image sources for diverse tasks,
respectively. ComPtr treats different inputs equally and builds an efficient
dense interaction model in the form of sequence-to-sequence on top of the
transformer. This task-generic design provides a smooth foundation for
constructing the unified model that can simultaneously deal with various
bi-source information. In extensive experiments across several representative
vision tasks, i.e. remote sensing change detection, RGB-T crowd counting,
RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed
method consistently obtains favorable performance. The code will be available
at \url{https://github.com/lartpang/ComPtr}
Multi-scale Interactive Network for Salient Object Detection
Deep-learning based salient object detection methods achieve great progress.
However, the variable scale and unknown category of salient objects are great
challenges all the time. These are closely related to the utilization of
multi-level and multi-scale features. In this paper, we propose the aggregate
interaction modules to integrate the features from adjacent levels, in which
less noise is introduced because of only using small up-/down-sampling rates.
To obtain more efficient multi-scale features from the integrated features, the
self-interaction modules are embedded in each decoder unit. Besides, the class
imbalance issue caused by the scale variation weakens the effect of the binary
cross entropy loss and results in the spatial inconsistency of the predictions.
Therefore, we exploit the consistency-enhanced loss to highlight the
fore-/back-ground difference and preserve the intra-class consistency.
Experimental results on five benchmark datasets demonstrate that the proposed
method without any post-processing performs favorably against 23
state-of-the-art approaches. The source code will be publicly available at
https://github.com/lartpang/MINet.Comment: Accepted by CVPR 202
CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection
methods utilize the convolution operation and construct complex interweave
fusion structures to achieve cross-modal information integration. The inherent
local connectivity of the convolution operation constrains the performance of
the convolution-based methods to a ceiling. In this work, we rethink these
tasks from the perspective of global information alignment and transformation.
Specifically, the proposed \underline{c}ross-mod\underline{a}l
\underline{v}iew-mixed transform\underline{er} (CAVER) cascades several
cross-modal integration units to construct a top-down transformer-based
information propagation path. CAVER treats the multi-scale and multi-modal
feature integration as a sequence-to-sequence context propagation and update
process built on a novel view-mixed attention mechanism. Besides, considering
the quadratic complexity w.r.t. the number of input tokens, we design a
parameter-free patch-wise token re-embedding strategy to simplify operations.
Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that
such a simple two-stream encoder-decoder framework can surpass recent
state-of-the-art methods when it is equipped with the proposed components.Comment: Updated version, more flexible structure, better performanc
ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection
Recent camouflaged object detection (COD) attempts to segment objects
visually blended into their surroundings, which is extremely complex and
difficult in real-world scenarios. Apart from the high intrinsic similarity
between camouflaged objects and their background, objects are usually diverse
in scale, fuzzy in appearance, and even severely occluded. To this end, we
propose an effective unified collaborative pyramid network which mimics human
behavior when observing vague images and videos, \textit{i.e.}, zooming in and
out. Specifically, our approach employs the zooming strategy to learn
discriminative mixed-scale semantics by the multi-head scale integration and
rich granularity perception units, which are designed to fully explore
imperceptible clues between candidate objects and background surroundings. The
former's intrinsic multi-head aggregation provides more diverse visual
patterns. The latter's routing mechanism can effectively propagate inter-frame
difference in spatiotemporal scenarios and adaptively ignore static
representations. They provides a solid foundation for realizing a unified
architecture for static and dynamic COD. Moreover, considering the uncertainty
and ambiguity derived from indistinguishable textures, we construct a simple
yet effective regularization, uncertainty awareness loss, to encourage
predictions with higher confidence in candidate regions. Our highly
task-friendly framework consistently outperforms existing state-of-the-art
methods in image and video COD benchmarks. The code will be available at
\url{https://github.com/lartpang/ZoomNeXt}.Comment: Extensions to the conference version: arXiv:2203.02688; Fixed some
word error
Adaptive Multi-source Predictor for Zero-shot Video Object Segmentation
Static and moving objects often occur in real-life videos. Most video object
segmentation methods only focus on extracting and exploiting motion cues to
perceive moving objects. Once faced with the frames of static objects, the
moving object predictors may predict failed results caused by uncertain motion
information, such as low-quality optical flow maps. Besides, different sources
such as RGB, depth, optical flow and static saliency can provide useful
information about the objects. However, existing approaches only consider
either the RGB or RGB and optical flow. In this paper, we propose a novel
adaptive multi-source predictor for zero-shot video object segmentation (ZVOS).
In the static object predictor, the RGB source is converted to depth and static
saliency sources, simultaneously. In the moving object predictor, we propose
the multi-source fusion structure. First, the spatial importance of each source
is highlighted with the help of the interoceptive spatial attention module
(ISAM). Second, the motion-enhanced module (MEM) is designed to generate pure
foreground motion attention for improving the representation of static and
moving features in the decoder. Furthermore, we design a feature purification
module (FPM) to filter the inter-source incompatible features. By using the
ISAM, MEM and FPM, the multi-source features are effectively fused. In
addition, we put forward an adaptive predictor fusion network (APF) to evaluate
the quality of the optical flow map and fuse the predictions from the static
object predictor and the moving object predictor in order to prevent
over-reliance on the failed results caused by low-quality optical flow maps.
Experiments show that the proposed model outperforms the state-of-the-art
methods on three challenging ZVOS benchmarks. And, the static object predictor
precisely predicts a high-quality depth map and static saliency map at the same
time.Comment: Accepted to IJCV 2024. Code is available at:
https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS. arXiv admin note:
substantial text overlap with arXiv:2108.0507
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection
The recently proposed camouflaged object detection (COD) attempts to segment
objects that are visually blended into their surroundings, which is extremely
complex and difficult in real-world scenarios. Apart from high intrinsic
similarity between the camouflaged objects and their background, the objects
are usually diverse in scale, fuzzy in appearance, and even severely occluded.
To deal with these problems, we propose a mixed-scale triplet network,
\textbf{ZoomNet}, which mimics the behavior of humans when observing vague
images, i.e., zooming in and out. Specifically, our ZoomNet employs the zoom
strategy to learn the discriminative mixed-scale semantics by the designed
scale integration unit and hierarchical mixed-scale unit, which fully explores
imperceptible clues between the candidate objects and background surroundings.
Moreover, considering the uncertainty and ambiguity derived from
indistinguishable textures, we construct a simple yet effective regularization
constraint, uncertainty-aware loss, to promote the model to accurately produce
predictions with higher confidence in candidate regions. Without bells and
whistles, our proposed highly task-friendly model consistently surpasses the
existing 23 state-of-the-art methods on four public datasets. Besides, the
superior performance over the recent cutting-edge models on the SOD task also
verifies the effectiveness and generality of our model. The code will be
available at \url{https://github.com/lartpang/ZoomNet}.Comment: Accepted by CVPR2022. This is the arxiv version that contains the
appendix sectio
Geological and geochemical data of 13634 source rock samples from 1286 exploration wells and 116489 porosity data from target layers in the six petroliferous basins of China
Geological and geochemical data of 13634 source rock samples from 1286 exploration wells in six representative petroliferous basins are examined to study their Active Source Rock Depth Limits (ASDL). Active source rocks and the discovered 21.6 billion tons of reserves in six representative basins in China and 52926 oil and gas reservoirs in the 1186 basins over the world are found to be distributed above the ASDL, illustrating the universality of such kind of depth limit