49 research outputs found
Cross-relation Cross-bag Attention for Distantly-supervised Relation Extraction
Distant supervision leverages knowledge bases to automatically label
instances, thus allowing us to train relation extractor without human
annotations. However, the generated training data typically contain massive
noise, and may result in poor performances with the vanilla supervised
learning. In this paper, we propose to conduct multi-instance learning with a
novel Cross-relation Cross-bag Selective Attention (CSA), which leads to
noise-robust training for distant supervised relation extractor. Specifically,
we employ the sentence-level selective attention to reduce the effect of noisy
or mismatched sentences, while the correlation among relations were captured to
improve the quality of attention weights. Moreover, instead of treating all
entity-pairs equally, we try to pay more attention to entity-pairs with a
higher quality. Similarly, we adopt the selective attention mechanism to
achieve this goal. Experiments with two types of relation extractor demonstrate
the superiority of the proposed approach over the state-of-the-art, while
further ablation studies verify our intuitions and demonstrate the
effectiveness of our proposed two techniques.Comment: AAAI 201
Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model
The rising demand for creating lifelike avatars in the digital realm has led
to an increased need for generating high-quality human videos guided by textual
descriptions and poses. We propose Dancing Avatar, designed to fabricate human
motion videos driven by poses and textual cues. Our approach employs a
pretrained T2I diffusion model to generate each video frame in an
autoregressive fashion. The crux of innovation lies in our adept utilization of
the T2I diffusion model for producing video frames successively while
preserving contextual relevance. We surmount the hurdles posed by maintaining
human character and clothing consistency across varying poses, along with
upholding the background's continuity amidst diverse human movements. To ensure
consistent human appearances across the entire video, we devise an intra-frame
alignment module. This module assimilates text-guided synthesized human
character knowledge into the pretrained T2I diffusion model, synergizing
insights from ChatGPT. For preserving background continuity, we put forth a
background alignment pipeline, amalgamating insights from segment anything and
image inpainting techniques. Furthermore, we propose an inter-frame alignment
module that draws inspiration from an auto-regressive pipeline to augment
temporal consistency between adjacent frames, where the preceding frame guides
the synthesis process of the current frame. Comparisons with state-of-the-art
methods demonstrate that Dancing Avatar exhibits the capacity to generate human
videos with markedly superior quality, both in terms of human and background
fidelity, as well as temporal coherence compared to existing state-of-the-art
approaches.Comment: 11 pages, 3 figure
Consensus Graph Representation Learning for Better Grounded Image Captioning
The contemporary visual captioning models frequently hallucinate objects that
are not actually in a scene, due to the visual misclassification or
over-reliance on priors that resulting in the semantic inconsistency between
the visual information and the target lexical words. The most common way is to
encourage the captioning model to dynamically link generated object words or
phrases to appropriate regions of the image, i.e., the grounded image
captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects)
that has not solved the key issue of object hallucination, i.e., the semantic
inconsistency. In this paper, we take a novel perspective on the issue above -
exploiting the semantic coherency between the visual and language modalities.
Specifically, we propose the Consensus Rraph Representation Learning framework
(CGRL) for GIC that incorporates a consensus representation into the grounded
captioning pipeline. The consensus is learned by aligning the visual graph
(e.g., scene graph) to the language graph that consider both the nodes and
edges in a graph. With the aligned consensus, the captioning model can capture
both the correct linguistic characteristics and visual relevance, and then
grounding appropriate image regions further. We validate the effectiveness of
our model, with a significant decline in object hallucination (-9% CHAIRi) on
the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several
automatic metrics and human evaluation, the results indicate that the proposed
approach can simultaneously improve the performance of image captioning (+2.9
Cider) and grounding (+2.3 F1LOC).Comment: 9 pages, 5 figures, AAAI 202
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
Scene Graph Generation (SGG) aims to extract
relationships in images for vision understanding. Although recent works have
made steady progress on SGG, they still suffer long-tail distribution issues
that tail-predicates are more costly to train and hard to distinguish due to a
small amount of annotated data compared to frequent predicates. Existing
re-balancing strategies try to handle it via prior rules but are still confined
to pre-defined conditions, which are not scalable for various models and
datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao)
framework, where a visually-prompted language model is learned to generate
diverse fine-grained predicates in a low-resource way. The proposed CaCao can
be applied in a plug-and-play fashion and automatically strengthen existing SGG
to tackle the long-tailed problem. Based on that, we further introduce a novel
Entangled cross-modal prompt approach for open-world predicate scene graph
generation (Epic), where models can generalize to unseen predicates in a
zero-shot manner. Comprehensive experiments on three benchmark datasets show
that CaCao consistently boosts the performance of multiple scene graph
generation models in a model-agnostic way. Moreover, our Epic achieves
competitive performance on open-world predicate prediction. The data and code
for this paper are publicly available.Comment: Accepted by ICCV 202
Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Owing to the unrestricted nature of the content in the training data, large
text-to-image diffusion models, such as Stable Diffusion (SD), are capable of
generating images with potentially copyrighted or dangerous content based on
corresponding textual concepts information. This includes specific intellectual
property (IP), human faces, and various artistic styles. However, Negative
Prompt, a widely used method for content removal, frequently fails to conceal
this content due to inherent limitations in its inference logic. In this work,
we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield
contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to
reconstruct the correlation between undesired concepts and their corresponding
image domain, we guide SD to generate meaningless content when such textual
concepts are provided as input. As this adaptation occurs at the level of the
model's weights, the SD, after DT, can be grafted onto other conditional
diffusion frameworks like ControlNet to shield unwanted concepts. In addition
to qualitatively showcasing the effectiveness of our DT method in protecting
various types of concepts, a quantitative comparison of the SD before and after
DT indicates that the DT method does not significantly impact the generative
quality of other contents. The FID and IS scores of the model on COCO-30K
exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and
38.25, respectively, which clearly outperforms the previous methods
Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels
Conventional multi-label classification (MLC) methods assume that all samples
are fully labeled and identically distributed. Unfortunately, this assumption
is unrealistic in large-scale MLC data that has long-tailed (LT) distribution
and partial labels (PL). To address the problem, we introduce a novel task,
Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC), to
jointly consider the above two imperfect learning environments. Not
surprisingly, we find that most LT-MLC and PL-MLC approaches fail to solve the
PLT-MLC, resulting in significant performance degradation on the two proposed
PLT-MLC benchmarks. Therefore, we propose an end-to-end learning framework:
\textbf{CO}rrection \textbf{M}odificat\textbf{I}on
balan\textbf{C}e, abbreviated as \textbf{\method{}}. Our bootstrapping
philosophy is to simultaneously correct the missing labels (Correction) with
convinced prediction confidence over a class-aware threshold and to learn from
these recall labels during training. We next propose a novel multi-focal
modifier loss that simultaneously addresses head-tail imbalance and
positive-negative imbalance to adaptively modify the attention to different
samples (Modification) under the LT class distribution. In addition, we develop
a balanced training strategy by distilling the model's learning effect from
head and tail samples, and thus design a balanced classifier (Balance)
conditioned on the head and tail learning effect to maintain stable performance
for all samples. Our experimental study shows that the proposed \method{}
significantly outperforms general MLC, LT-MLC and PL-MLC methods in terms of
effectiveness and robustness on our newly created PLT-MLC datasets