44 research outputs found
Weakly Supervised Object Localization Using Things and Stuff Transfer
We propose to help weakly supervised object localization for classes where
location annotations are not available, by transferring things and stuff
knowledge from a source set with available annotations. The source and target
classes might share similar appearance (e.g. bear fur is similar to cat fur) or
appear against similar background (e.g. horse and sheep appear against grass).
To exploit this, we acquire three types of knowledge from the source set: a
segmentation model trained on both thing and stuff classes; similarity
relations between target and source classes; and co-occurrence relations
between thing and stuff classes in the source. The segmentation model is used
to generate thing and stuff segmentation maps on a target image, while the
class similarity and co-occurrence knowledge help refining them. We then
incorporate these maps as new cues into a multiple instance learning framework
(MIL), propagating the transferred knowledge from the pixel level to the object
proposal level. In extensive experiments, we conduct our transfer from the
PASCAL Context dataset (source) to the ILSVRC, COCO and PASCAL VOC 2007
datasets (targets). We evaluate our transfer across widely different thing
classes, including some that are not similar in appearance, but appear against
similar background. The results demonstrate significant improvement over
standard MIL, and we outperform the state-of-the-art in the transfer setting.Comment: ICCV 2017 camera-ready including supplementary materia
Enhancing surgical instrument segmentation:integrating vision transformer insights with adapter
PURPOSE: In surgical image segmentation, a major challenge is the extensive time and resources required to gather large-scale annotated datasets. Given the scarcity of annotated data in this field, our work aims to develop a model that achieves competitive performance with training on limited datasets, while also enhancing model robustness in various surgical scenarios.METHODS: We propose a method that harnesses the strengths of pre-trained Vision Transformers (ViTs) and data efficiency of convolutional neural networks (CNNs). Specifically, we demonstrate how a CNN segmentation model can be used as a lightweight adapter for a frozen ViT feature encoder. Our novel feature adapter uses cross-attention modules that merge the multiscale features derived from the CNN encoder with feature embeddings from ViT, ensuring integration of the global insights from ViT along with local information from CNN.RESULTS: Extensive experiments demonstrate our method outperforms current models in surgical instrument segmentation. Specifically, it achieves superior performance in binary segmentation on the Robust-MIS 2019 dataset, as well as in multiclass segmentation tasks on the EndoVis 2017 and EndoVis 2018 datasets. It also showcases remarkable robustness through cross-dataset validation across these 3 datasets, along with the CholecSeg8k and AutoLaparo datasets. Ablation studies based on the datasets prove the efficacy of our novel adapter module.CONCLUSION: In this study, we presented a novel approach integrating ViT and CNN. Our unique feature adapter successfully combines the global insights of ViT with the local, multi-scale spatial capabilities of CNN. This integration effectively overcomes data limitations in surgical instrument segmentation. The source code is available at: https://github.com/weimengmeng1999/AdapterSIS.git .</p
Facial Video-based Remote Physiological Measurement via Self-supervised Learning
Facial video-based remote physiological measurement aims to estimate remote
photoplethysmography (rPPG) signals from human face videos and then measure
multiple vital signs (e.g. heart rate, respiration frequency) from rPPG
signals. Recent approaches achieve it by training deep neural networks, which
normally require abundant facial videos and synchronously recorded
photoplethysmography (PPG) signals for supervision. However, the collection of
these annotated corpora is not easy in practice. In this paper, we introduce a
novel frequency-inspired self-supervised framework that learns to estimate rPPG
signals from facial videos without the need of ground truth PPG signals. Given
a video sample, we first augment it into multiple positive/negative samples
which contain similar/dissimilar signal frequencies to the original one.
Specifically, positive samples are generated using spatial augmentation.
Negative samples are generated via a learnable frequency augmentation module,
which performs non-linear signal frequency transformation on the input without
excessively changing its visual appearance. Next, we introduce a local rPPG
expert aggregation module to estimate rPPG signals from augmented samples. It
encodes complementary pulsation information from different face regions and
aggregate them into one rPPG prediction. Finally, we propose a series of
frequency-inspired losses, i.e. frequency contrastive loss, frequency ratio
consistency loss, and cross-video frequency agreement loss, for the
optimization of estimated rPPG signals from multiple augmented video samples
and across temporally neighboring video samples. We conduct rPPG-based heart
rate, heart rate variability and respiration frequency estimation on four
standard benchmarks. The experimental results demonstrate that our method
improves the state of the art by a large margin.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligenc
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image
understanding by simultaneously segmenting objects and predicting relations
among objects. However, the long-tail problem among relations leads to
unsatisfactory results in real-world applications. Prior methods predominantly
rely on vision information or utilize limited language information, such as
object or relation names, thereby overlooking the utility of language
information. Leveraging the recent progress in Large Language Models (LLMs), we
propose to use language information to assist relation prediction, particularly
for rare relations. To this end, we propose the Vision-Language Prompting
(VLPrompt) model, which acquires vision information from images and language
information from LLMs. Then, through a prompter network based on attention
mechanism, it achieves precise relation prediction. Our extensive experiments
show that VLPrompt significantly outperforms previous state-of-the-art methods
on the PSG dataset, proving the effectiveness of incorporating language
information and alleviating the long-tail problem of relations.Comment: 19 pages, 9 figure
TreeFormer: a Semi-Supervised Transformer-based Framework for Tree Counting from a Single High Resolution Image
Automatic tree density estimation and counting using single aerial and
satellite images is a challenging task in photogrammetry and remote sensing,
yet has an important role in forest management. In this paper, we propose the
first semisupervised transformer-based framework for tree counting which
reduces the expensive tree annotations for remote sensing images. Our method,
termed as TreeFormer, first develops a pyramid tree representation module based
on transformer blocks to extract multi-scale features during the encoding
stage. Contextual attention-based feature fusion and tree density regressor
modules are further designed to utilize the robust features from the encoder to
estimate tree density maps in the decoder. Moreover, we propose a pyramid
learning strategy that includes local tree density consistency and local tree
count ranking losses to utilize unlabeled images into the training process.
Finally, the tree counter token is introduced to regulate the network by
computing the global tree counts for both labeled and unlabeled images. Our
model was evaluated on two benchmark tree counting datasets, Jiangsu, and
Yosemite, as well as a new dataset, KCL-London, created by ourselves. Our
TreeFormer outperforms the state of the art semi-supervised methods under the
same setting and exceeds the fully-supervised methods using the same number of
labeled images. The codes and datasets are available at
https://github.com/HAAClassic/TreeFormer.Comment: Accepted in IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSIN
Weakly Supervised Object Localization Using Size Estimates
We present a technique for weakly supervised object localization (WSOL),
building on the observation that WSOL algorithms usually work better on images
with bigger objects. Instead of training the object detector on the entire
training set at the same time, we propose a curriculum learning strategy to
feed training images into the WSOL learning loop in an order from images
containing bigger objects down to smaller ones. To automatically determine the
order, we train a regressor to estimate the size of the object given the whole
image as input. Furthermore, we use these size estimates to further improve the
re-localization step of WSOL by assigning weights to object proposals according
to how close their size matches the estimated object size. We demonstrate the
effectiveness of using size order and size weighting on the challenging PASCAL
VOC 2007 dataset, where we achieve a significant improvement over existing
state-of-the-art WSOL techniques.Comment: ECCV 2016 camera-read
Domain-General Crowd Counting in Unseen Scenarios
Domain shift across crowd data severely hinders crowd counting models to
generalize to unseen scenarios. Although domain adaptive crowd counting
approaches close this gap to a certain extent, they are still dependent on the
target domain data to adapt (e.g. finetune) their models to the specific
domain. In this paper, we aim to train a model based on a single source domain
which can generalize well on any unseen domain. This falls into the realm of
domain generalization that remains unexplored in crowd counting. We first
introduce a dynamic sub-domain division scheme which divides the source domain
into multiple sub-domains such that we can initiate a meta-learning framework
for domain generalization. The sub-domain division is dynamically refined
during the meta-learning. Next, in order to disentangle domain-invariant
information from domain-specific information in image features, we design the
domain-invariant and -specific crowd memory modules to re-encode image
features. Two types of losses, i.e. feature reconstruction and orthogonal
losses, are devised to enable this disentanglement. Extensive experiments on
several standard crowd counting benchmarks i.e. SHA, SHB, QNRF, and NWPU, show
the strong generalizability of our method.Comment: Accepted to AAAI 2023 as Oral Presentatio
HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation
Panoptic Scene Graph generation (PSG) is a recently proposed task in image
scene understanding that aims to segment the image and extract triplets of
subjects, objects and their relations to build a scene graph. This task is
particularly challenging for two reasons. First, it suffers from a long-tail
problem in its relation categories, making naive biased methods more inclined
to high-frequency relations. Existing unbiased methods tackle the long-tail
problem by data/loss rebalancing to favor low-frequency relations. Second, a
subject-object pair can have two or more semantically overlapping relations.
While existing methods favor one over the other, our proposed HiLo framework
lets different network branches specialize on low and high frequency relations,
enforce their consistency and fuse the results. To the best of our knowledge we
are the first to propose an explicitly unbiased PSG method. In extensive
experiments we show that our HiLo framework achieves state-of-the-art results
on the PSG task. We also apply our method to the Scene Graph Generation task
that predicts boxes instead of masks and see improvements over all baseline
methods
Redesigning Multi-Scale Neural Network for Crowd Counting
Perspective distortions and crowd variations make crowd counting a
challenging task in computer vision. To tackle it, many previous works have
used multi-scale architecture in deep neural networks (DNNs). Multi-scale
branches can be either directly merged (e.g. by concatenation) or merged
through the guidance of proxies (e.g. attentions) in the DNNs. Despite their
prevalence, these combination methods are not sophisticated enough to deal with
the per-pixel performance discrepancy over multi-scale density maps. In this
work, we redesign the multi-scale neural network by introducing a hierarchical
mixture of density experts, which hierarchically merges multi-scale density
maps for crowd counting. Within the hierarchical structure, an expert
competition and collaboration scheme is presented to encourage contributions
from all scales; pixel-wise soft gating nets are introduced to provide
pixel-wise soft weights for scale combinations in different hierarchies. The
network is optimized using both the crowd density map and the local counting
map, where the latter is obtained by local integration on the former.
Optimizing both can be problematic because of their potential conflicts. We
introduce a new relative local counting loss based on relative count
differences among hard-predicted local regions in an image, which proves to be
complementary to the conventional absolute error loss on the density map.
Experiments show that our method achieves the state-of-the-art performance on
five public datasets, i.e. ShanghaiTech, UCF_CC_50, JHU-CROWD++, NWPU-Crowd and
Trancos.Comment: IEEE Transactions on Image Processin
Enhancing Space-time Video Super-resolution via Spatial-temporal Feature Interaction
The target of space-time video super-resolution (STVSR) is to increase both
the frame rate (also referred to as the temporal resolution) and the spatial
resolution of a given video. Recent approaches solve STVSR with end-to-end deep
neural networks. A popular solution is to first increase the frame rate of the
video; then perform feature refinement among different frame features; and last
increase the spatial resolutions of these features. The temporal correlation
among features of different frames is carefully exploited in this process. The
spatial correlation among features of different (spatial) resolutions, despite
being also very important, is however not emphasized. In this paper, we propose
a spatial-temporal feature interaction network to enhance STVSR by exploiting
both spatial and temporal correlations among features of different frames and
spatial resolutions. Specifically, the spatial-temporal frame interpolation
module is introduced to interpolate low- and high-resolution intermediate frame
features simultaneously and interactively. The spatial-temporal local and
global refinement modules are respectively deployed afterwards to exploit the
spatial-temporal correlation among different features for their refinement.
Finally, a novel motion consistency loss is employed to enhance the motion
continuity among reconstructed frames. We conduct experiments on three standard
benchmarks, Vid4, Vimeo-90K and Adobe240, and the results demonstrate that our
method improves the state of the art methods by a considerable margin. Our
codes will be available at
https://github.com/yuezijie/STINet-Space-time-Video-Super-resolution