42 research outputs found
Pose Guided Human Image Synthesis with Partially Decoupled GAN
Pose Guided Human Image Synthesis (PGHIS) is a challenging task of
transforming a human image from the reference pose to a target pose while
preserving its style. Most existing methods encode the texture of the whole
reference human image into a latent space, and then utilize a decoder to
synthesize the image texture of the target pose. However, it is difficult to
recover the detailed texture of the whole human image. To alleviate this
problem, we propose a method by decoupling the human body into several parts
(\eg, hair, face, hands, feet, \etc) and then using each of these parts to
guide the synthesis of a realistic image of the person, which preserves the
detailed information of the generated images. In addition, we design a
multi-head attention-based module for PGHIS. Because most convolutional neural
network-based methods have difficulty in modeling long-range dependency due to
the convolutional operation, the long-range modeling capability of attention
mechanism is more suitable than convolutional neural networks for pose transfer
task, especially for sharp pose deformation. Extensive experiments on
Market-1501 and DeepFashion datasets reveal that our method almost outperforms
other existing state-of-the-art methods in terms of both qualitative and
quantitative metrics.Comment: 16 pages, 14th Asian Conference on Machine Learning conferenc
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
In this work, we focus on open vocabulary instance segmentation to expand a
segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex
pipelines to establish one-to-one mappings between image regions and words in
captions. However, such methods build noisy supervision by matching non-visible
words to image regions, such as adjectives and verbs. Meanwhile, context words
are also important for inferring the existence of novel objects as they show
high inter-correlations with novel categories. To overcome these limitations,
we devise a joint \textbf{Caption Grounding and Generation (CGG)} framework,
which incorporates a novel grounding loss that only focuses on matching object
nouns to improve learning efficiency. We also introduce a caption generation
head that enables additional supervision and contextual modeling as a
complementation to the grounding loss. Our analysis and results demonstrate
that grounding and generation components complement each other, significantly
enhancing the segmentation performance for novel classes. Experiments on the
COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS)
and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the
CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel
classes without extra data on the OVIS task and 15% PQ improvements for novel
classes on the OSPS benchmark.Comment: ICCV-202
Towards Open Vocabulary Learning: A Survey
In the field of visual scene understanding, deep neural networks have made
impressive advancements in various core tasks like segmentation, tracking, and
detection. However, most approaches operate on the close-set assumption,
meaning that the model can only identify pre-defined categories that are
present in the training set. Recently, open vocabulary settings were proposed
due to the rapid progress of vision language pre-training. These new approaches
seek to locate and recognize categories beyond the annotated label space. The
open vocabulary approach is more general, practical, and effective compared to
weakly supervised and zero-shot settings. This paper provides a thorough review
of open vocabulary learning, summarizing and analyzing recent developments in
the field. In particular, we begin by comparing it to related concepts such as
zero-shot learning, open-set recognition, and out-of-distribution detection.
Then, we review several closely related tasks in the case of segmentation and
detection, including long-tail problems, few-shot, and zero-shot settings. For
the method survey, we first present the basic knowledge of detection and
segmentation in close-set as the preliminary knowledge. Next, we examine
various scenarios in which open vocabulary learning is used, identifying common
design elements and core ideas. Then, we compare the recent detection and
segmentation approaches in commonly used datasets and benchmarks. Finally, we
conclude with insights, issues, and discussions regarding future research
directions. To our knowledge, this is the first comprehensive literature review
of open vocabulary learning. We keep tracing related works at
https://github.com/jianzongwu/Awesome-Open-Vocabulary.Comment: Project page at https://github.com/jianzongwu/Awesome-Open-Vocabular
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
In this work, we focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and words in captions. However, such methods build noisy supervision by matching non-visible words to image regions, such as adjectives and verbs. Meanwhile, context words are also important for inferring the existence of novel objects as they show high inter-correlations with novel categories. To overcome these limitations, we devise a joint Caption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object nouns to improve learning efficiency. We also introduce a caption generation head that enables additional supervision and contextual modeling as a complementation to the grounding loss. Our analysis and results demonstrate that grounding and generation components complement each other, significantly enhancing the segmentation performance for novel classes. Experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel classes without extra data on the OVIS task and 15% PQ improvements for novel classes on the OSPS benchmark
Augmentation-induced Consistency Regularization for Classification
Deep neural networks have become popular in many supervised learning tasks,
but they may suffer from overfitting when the training dataset is limited. To
mitigate this, many researchers use data augmentation, which is a widely used
and effective method for increasing the variety of datasets. However, the
randomness introduced by data augmentation causes inevitable inconsistency
between training and inference, which leads to poor improvement. In this paper,
we propose a consistency regularization framework based on data augmentation,
called CR-Aug, which forces the output distributions of different sub models
generated by data augmentation to be consistent with each other. Specifically,
CR-Aug evaluates the discrepancy between the output distributions of two
augmented versions of each sample, and it utilizes a stop-gradient operation to
minimize the consistency loss. We implement CR-Aug to image and audio
classification tasks and conduct extensive experiments to verify its
effectiveness in improving the generalization ability of classifiers. Our
CR-Aug framework is ready-to-use, it can be easily adapted to many
state-of-the-art network architectures. Our empirical results show that CR-Aug
outperforms baseline methods by a significant margin.Comment: This paper is accepted by IJCNN202
Improving Human Image Synthesis with Residual Fast Fourier Transformation and Wasserstein Distance
With the rapid development of the Metaverse, virtual humans have emerged, and
human image synthesis and editing techniques, such as pose transfer, have
recently become popular. Most of the existing techniques rely on GANs, which
can generate good human images even with large variants and occlusions. But
from our best knowledge, the existing state-of-the-art method still has the
following problems: the first is that the rendering effect of the synthetic
image is not realistic, such as poor rendering of some regions. And the second
is that the training of GAN is unstable and slow to converge, such as model
collapse. Based on the above two problems, we propose several methods to solve
them. To improve the rendering effect, we use the Residual Fast Fourier
Transform Block to replace the traditional Residual Block. Then, spectral
normalization and Wasserstein distance are used to improve the speed and
stability of GAN training. Experiments demonstrate that the methods we offer
are effective at solving the problems listed above, and we get state-of-the-art
scores in LPIPS and PSNR.Comment: This paper is accepted by IJCNN202
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
<p>Open Vocabulary Instance Segmentation</p></jats:p
Discovery of a Potential HER2 Inhibitor from Natural Products for the Treatment of HER2-Positive Breast Cancer
Breast cancer is one of the most lethal types of cancer in women worldwide due to the late stage detection and resistance to traditional chemotherapy. The human epidermal growth factor receptor 2 (HER2) is considered as a validated target in breast cancer therapy. Even though a substantial effort has been made to develop HER2 inhibitors, only lapatinib has been approved by the U.S. Food and Drug Administration (FDA). Side effects were observed in a majority of the patients within one year of treatment initiation. Here, we took advantage of bioinformatics tools to identify novel effective HER2 inhibitors. The structure-based virtual screening combined with ADMET (absorption, distribution, metabolism, excretion and toxicity) prediction was explored. In total, 11,247 natural compounds were screened. The top hits were evaluated by an in vitro HER2 kinase inhibition assay. The cell proliferation inhibition effect of identified inhibitors was evaluated in HER2-overexpressing SKBR3 and BT474 cell lines. We found that ZINC15122021 showed favorable ADMET properties and attained high binding affinity against HER2. Moreover, ZINC15122021 showed high kinase inhibition activity against HER2 and presented outstanding cell proliferation inhibition activity against both SKBR3 and BT474 cell lines. Results reveal that ZINC15122021 can be a potential HER2 inhibitor