11 research outputs found
Region Refinement Network for Salient Object Detection
Albeit intensively studied, false prediction and unclear boundaries are still
major issues of salient object detection. In this paper, we propose a Region
Refinement Network (RRN), which recurrently filters redundant information and
explicitly models boundary information for saliency detection. Different from
existing refinement methods, we propose a Region Refinement Module (RRM) that
optimizes salient region prediction by incorporating supervised attention masks
in the intermediate refinement stages. The module only brings a minor increase
in model size and yet significantly reduces false predictions from the
background. To further refine boundary areas, we propose a Boundary Refinement
Loss (BRL) that adds extra supervision for better distinguishing foreground
from background. BRL is parameter free and easy to train. We further observe
that BRL helps retain the integrity in prediction by refining the boundary.
Extensive experiments on saliency detection datasets show that our refinement
module and loss bring significant improvement to the baseline and can be easily
applied to different frameworks. We also demonstrate that our proposed model
generalizes well to portrait segmentation and shadow detection tasks
VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection
In recent years, transformer-based detectors have demonstrated remarkable
performance in 2D visual perception tasks. However, their performance in
multi-view 3D object detection remains inferior to the state-of-the-art (SOTA)
of convolutional neural network based detectors. In this work, we investigate
this issue from the perspective of bird's-eye-view (BEV) feature generation.
Specifically, we examine the BEV feature generation method employed by the
transformer-based SOTA, BEVFormer, and identify its two limitations: (i) it
only generates attention weights from BEV, which precludes the use of lidar
points for supervision, and (ii) it aggregates camera view features to the BEV
through deformable sampling, which only selects a small subset of features and
fails to exploit all information. To overcome these limitations, we propose a
novel BEV feature generation method, dual-view attention, which generates
attention weights from both the BEV and camera view. This method encodes all
camera features into the BEV feature. By combining dual-view attention with the
BEVFormer architecture, we build a new detector named VoxelFormer. Extensive
experiments are conducted on the nuScenes benchmark to verify the superiority
of dual-view attention and VoxelForer. We observe that even only adopting 3
encoders and 1 historical frame during training, VoxelFormer still outperforms
BEVFormer significantly. When trained in the same setting, VoxelFormer can
surpass BEVFormer by 4.9% NDS point. Code is available at:
https://github.com/Lizhuoling/VoxelFormer-public.git
LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields
We introduce a new task, novel view synthesis for LiDAR sensors. While
traditional model-based LiDAR simulators with style-transfer neural networks
can be applied to render novel views, they fall short of producing accurate and
realistic LiDAR patterns because the renderers rely on explicit 3D
reconstruction and exploit game engines, that ignore important attributes of
LiDAR points. We address this challenge by formulating, to the best of our
knowledge, the first differentiable end-to-end LiDAR rendering framework,
LiDAR-NeRF, leveraging a neural radiance field (NeRF) to facilitate the joint
learning of geometry and the attributes of 3D points. However, simply employing
NeRF cannot achieve satisfactory results, as it only focuses on learning
individual pixels while ignoring local information, especially at low texture
areas, resulting in poor geometry. To this end, we have taken steps to address
this issue by introducing a structural regularization method to preserve local
structural details. To evaluate the effectiveness of our approach, we establish
an object-centric multi-view LiDAR dataset, dubbed NeRF-MVL. It contains
observations of objects from 9 categories seen from 360-degree viewpoints
captured with multiple LiDAR sensors. Our extensive experiments on the
scene-level KITTI-360 dataset, and on our object-level NeRF-MVL show that our
LiDAR-NeRF surpasses the model-based algorithms significantly.Comment: This paper introduces a new task of novel LiDAR view synthesis, and
proposes a differentiable framework called LiDAR-NeRF with a structural
regularization, as well as an object-centric multi-view LiDAR dataset called
NeRF-MV
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Most recent semantic segmentation methods adopt a fully-convolutional network
(FCN) with an encoder-decoder architecture. The encoder progressively reduces
the spatial resolution and learns more abstract/semantic visual concepts with
larger receptive fields. Since context modeling is critical for segmentation,
the latest efforts have been focused on increasing the receptive field, through
either dilated/atrous convolutions or inserting attention modules. However, the
encoder-decoder based FCN architecture remains unchanged. In this paper, we aim
to provide an alternative perspective by treating semantic segmentation as a
sequence-to-sequence prediction task. Specifically, we deploy a pure
transformer (ie, without convolution and resolution reduction) to encode an
image as a sequence of patches. With the global context modeled in every layer
of the transformer, this encoder can be combined with a simple decoder to
provide a powerful segmentation model, termed SEgmentation TRansformer (SETR).
Extensive experiments show that SETR achieves new state of the art on ADE20K
(50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on
Cityscapes. Particularly, we achieve the first position in the highly
competitive ADE20K test server leaderboard on the day of submission.Comment: CVPR 2021. Project page at https://fudan-zvg.github.io/SETR
Formation of Pseudomonas aeruginosa inhibition zone during Tobramycin disk diffusion is due to a transition from planktonic to biofilm mode of growth
Do Different Tracking Tasks Require Different Appearance Models?
Tracking objects of interest in a video is one of the most popular and widely
applicable problems in computer vision. However, with the years, a Cambrian
explosion of use cases and benchmarks has fragmented the problem in a multitude
of different experimental setups. As a consequence, the literature has
fragmented too, and now novel approaches proposed by the community are usually
specialised to fit only one specific setup. To understand to what extent this
specialisation is necessary, in this work we present UniTrack, a solution to
address five different tasks within the same framework. UniTrack consists of a
single and task-agnostic appearance model, which can be learned in a supervised
or self-supervised fashion, and multiple ``heads'' that address individual
tasks and do not require training. We show how most tracking tasks can be
solved within this framework, and that the same appearance model can be
successfully used to obtain results that are competitive against specialised
methods for most of the tasks considered. The framework also allows us to
analyse appearance models obtained with the most recent self-supervised
methods, thus extending their evaluation and comparison to a larger variety of
important problems.Comment: To appear at NeurIPS 202
Role of m6A modification in immune microenvironment of digestive system tumors
Digestive system tumors are huge health problem worldwide, largely attributable to poor dietary choices. The role of RNA modifications in cancer development is an emerging field of research. RNA modifications are associated with the growth and development of various immune cells, which, in turn, regulate the immune response. The majority of RNA modifications are methylation modifications, and the most common type is the N6-methyladenosine (m6A) modification. Here, we reviewed the molecular mechanism of m6A in the immune cells and the role of m6A in the digestive system tumors. However, further studies are required to better understand the role of RNA methylation in human cancers for designing diagnostic and treatment strategies and predicting the prognosis of patients
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation
Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatile—it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods