55 research outputs found
Inductive Relation Prediction from Relational Paths and Context with Hierarchical Transformers
Relation prediction on knowledge graphs (KGs) is a key research topic.
Dominant embedding-based methods mainly focus on the transductive setting and
lack the inductive ability to generalize to new entities for inference.
Existing methods for inductive reasoning mostly mine the connections between
entities, i.e., relational paths, without considering the nature of head and
tail entities contained in the relational context. This paper proposes a novel
method that captures both connections between entities and the intrinsic nature
of entities, by simultaneously aggregating RElational Paths and cOntext with a
unified hieRarchical Transformer framework, namely REPORT. REPORT relies solely
on relation semantics and can naturally generalize to the fully-inductive
setting, where KGs for training and inference have no common entities. In the
experiments, REPORT performs consistently better than all baselines on almost
all the eight version subsets of two fully-inductive datasets. Moreover. REPORT
is interpretable by providing each element's contribution to the prediction
results.Comment: Accepted by ICASSP 2023 (Oral
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation
Existing autoregressive models follow the two-stage generation paradigm that
first learns a codebook in the latent space for image reconstruction and then
completes the image generation autoregressively based on the learned codebook.
However, existing codebook learning simply models all local region information
of images without distinguishing their different perceptual importance, which
brings redundancy in the learned codebook that not only limits the next stage's
autoregressive model's ability to model important structure but also results in
high training cost and slow generation speed. In this study, we borrow the idea
of importance perception from classical image coding theory and propose a novel
two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) and
Stackformer, to relieve the model from modeling redundancy. Specifically,
MQ-VAE incorporates an adaptive mask module for masking redundant region
features before quantization and an adaptive de-mask module for recovering the
original grid image feature map to faithfully reconstruct the original images
after quantization. Then, Stackformer learns to predict the combination of the
next code and its position in the feature map. Comprehensive experiments on
various image generation validate our effectiveness and efficiency. Code will
be released at https://github.com/CrossmodalGroup/MaskedVectorQuantization.Comment: accepted by CVPR 202
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation
Achieving empathy is a crucial step toward humanized dialogue systems.
Current approaches for empathetic dialogue generation mainly perceive an
emotional label to generate an empathetic response conditioned on it, which
simply treat emotions independently, but ignore the intrinsic emotion
correlation in dialogues, resulting in inaccurate emotion perception and
unsuitable response generation. In this paper, we propose a novel emotion
correlation enhanced empathetic dialogue generation framework, which
comprehensively realizes emotion correlation learning, utilization, and
supervising. Specifically, a multi-resolution emotion graph is devised to
capture context-based emotion interactions from different resolutions, further
modeling emotion correlation. Then we propose an emotion correlation enhanced
decoder, with a novel correlation-aware aggregation and soft/hard strategy,
respectively improving the emotion perception and response generation.
Experimental results on the benchmark dataset demonstrate the superiority of
our model in both empathetic perception and expression.Comment: 19 pages, 6 figure
Image Captioning with Context-Aware Auxiliary Guidance
Image captioning is a challenging computer vision task, which aims to
generate a natural language description of an image. Most recent researches
follow the encoder-decoder framework which depends heavily on the previous
generated words for the current prediction. Such methods can not effectively
take advantage of the future predicted information to learn complete semantics.
In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism
that can guide the captioning model to perceive global contexts. Upon the
captioning model, CAAG performs semantic attention that selectively
concentrates on useful information of the global predictions to reproduce the
current generation. To validate the adaptability of the method, we apply CAAG
to three popular captioners and our proposal achieves competitive performance
on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2
CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official
online evaluation server
On the Calibration of Large Language Models and Alignment
As large language models attract increasing attention and find widespread
application, concurrent challenges of reliability also arise at the same time.
Confidence calibration, an effective analysis method for gauging the
reliability of deep models, serves as a crucial tool for assessing and
improving their reliability. However, such investigation has been comparatively
underexplored. In this work, we conduct a systematic examination of the
calibration of aligned language models throughout the entire construction
process, including pretraining and alignment training. At each stage, we
investigate how different training settings, such as parameter scales and
training data, affect model calibration. To thoroughly assess model
calibration, we evaluate models on three most concerned aspects: generation,
factuality and understanding. Our work sheds light on whether popular LLMs are
well-calibrated and how the training process influences model calibration.Comment: to be published in findings of EMNLP-202
Random Entity Quantization for Parameter-Efficient Compositional Knowledge Graph Representation
Representation Learning on Knowledge Graphs (KGs) is essential for downstream
tasks. The dominant approach, KG Embedding (KGE), represents entities with
independent vectors and faces the scalability challenge. Recent studies propose
an alternative way for parameter efficiency, which represents entities by
composing entity-corresponding codewords matched from predefined small-scale
codebooks. We refer to the process of obtaining corresponding codewords of each
entity as entity quantization, for which previous works have designed
complicated strategies. Surprisingly, this paper shows that simple random
entity quantization can achieve similar results to current strategies. We
analyze this phenomenon and reveal that entity codes, the quantization outcomes
for expressing entities, have higher entropy at the code level and Jaccard
distance at the codeword level under random entity quantization. Therefore,
different entities become more easily distinguished, facilitating effective KG
representation. The above results show that current quantization strategies are
not critical for KG representation, and there is still room for improvement in
entity distinguishability beyond current strategies. The code to reproduce our
results is available at https://github.com/JiaangL/RandomQuantization.Comment: Accepted to EMNLP 202
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Text-to-image customization, which aims to synthesize text-driven images for
the given subjects, has recently revolutionized content creation. Existing
works follow the pseudo-word paradigm, i.e., represent the given subjects as
pseudo-words and then compose them with the given text. However, the inherent
entangled influence scope of pseudo-words with the given text results in a
dual-optimum paradox, i.e., the similarity of the given subjects and the
controllability of the given text could not be optimal simultaneously. We
present RealCustom that, for the first time, disentangles similarity from
controllability by precisely limiting subject influence to relevant parts only,
achieved by gradually narrowing real text word from its general connotation to
the specific subject and using its cross-attention to distinguish relevance.
Specifically, RealCustom introduces a novel "train-inference" decoupled
framework: (1) during training, RealCustom learns general alignment between
visual conditions to original textual conditions by a novel adaptive scoring
module to adaptively modulate influence quantity; (2) during inference, a novel
adaptive mask guidance strategy is proposed to iteratively update the influence
scope and influence quantity of the given subjects to gradually narrow the
generation of the real text word. Comprehensive experiments demonstrate the
superior real-time customization ability of RealCustom in the open domain,
achieving both unprecedented similarity of the given subjects and
controllability of the given text for the first time. The project page is
https://corleone-huang.github.io/realcustom/.Comment: Accepted by CVPR202
Improving Image Captioning via Predicting Structured Concepts
Having the difficulty of solving the semantic gap between images and texts
for the image captioning task, conventional studies in this area paid some
attention to treating semantic concepts as a bridge between the two modalities
and improved captioning performance accordingly. Although promising results on
concept prediction were obtained, the aforementioned studies normally ignore
the relationship among concepts, which relies on not only objects in the image,
but also word dependencies in the text, so that offers a considerable potential
for improving the process of generating good descriptions. In this paper, we
propose a structured concept predictor (SCP) to predict concepts and their
structures, then we integrate them into captioning, so as to enhance the
contribution of visual signals in this task via concepts and further use their
relations to distinguish cross-modal semantics for better description
generation. Particularly, we design weighted graph convolutional networks
(W-GCN) to depict concept relations driven by word dependencies, and then
learns differentiated contributions from these concepts for following decoding
process. Therefore, our approach captures potential relations among concepts
and discriminatively learns different concepts, so that effectively facilitates
image captioning with inherited information across modalities. Extensive
experiments and their results demonstrate the effectiveness of our approach as
well as each proposed module in this work.Comment: Accepted by EMNLP 2023 (Main Conference, Oral
- …