254 research outputs found
Learning Distinct and Representative Modes for Image Captioning
Over the years, state-of-the-art (SoTA) image captioning methods have
achieved promising results on some evaluation metrics (e.g., CIDEr). However,
recent findings show that the captions generated by these methods tend to be
biased toward the "average" caption that only captures the most general mode
(a.k.a, language pattern) in the training corpus, i.e., the so-called mode
collapse problem. Affected by it, the generated captions are limited in
diversity and usually less informative than natural image descriptions made by
humans. In this paper, we seek to avoid this problem by proposing a Discrete
Mode Learning (DML) paradigm for image captioning. Our innovative idea is to
explore the rich modes in the training caption corpus to learn a set of "mode
embeddings", and further use them to control the mode of the generated captions
for existing image captioning models. Specifically, the proposed DML optimizes
a dual architecture that consists of an image-conditioned discrete variational
autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC)
branch. The CdVAE branch maps each image caption to one of the mode embeddings
stored in a learned codebook, and is trained with a pure non-autoregressive
generation objective to make the modes distinct and representative. The MIC
branch can be simply modified from an existing image captioning model, where
the mode embedding is added to the original word embeddings as the control
signal. In the experiments, we apply the proposed DML to two widely used image
captioning models, Transformer and AoANet. The results show that the learned
mode embedding successfully facilitates these models to generate high-quality
image captions with different modes, further leading to better performance for
both diversity and quality on the MSCOCO dataset.Comment: To be appeared in NeurIPS 202
ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora
Generating visually grounded image captions with specific linguistic styles
using unpaired stylistic corpora is a challenging task, especially since we
expect stylized captions with a wide variety of stylistic patterns. In this
paper, we propose a novel framework to generate Accurate and Diverse Stylized
Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to
align the image and text features, which unifies paired factual and unpaired
stylistic corpora during the training process. A conditional variational
auto-encoder is then used to automatically memorize diverse stylistic patterns
in latent space and enhance diversity through sampling. We also design a simple
but effective recheck module to boost style accuracy by filtering
style-specific captions. Experimental results on two widely used stylized image
captioning datasets show that regarding consistency with the image, style
accuracy and diversity, ADS-Cap achieves outstanding performances compared to
various baselines. We finally conduct extensive analyses to understand the
effectiveness of our method. Our code is available at
https://github.com/njucckevin/ADS-Cap.Comment: Accepted at Natural Language Processing and Chinese Computing (NLPCC)
202
AMENet: Attentive Maps Encoder Network for Trajectory Prediction
Trajectory prediction is critical for applications of planning safe future
movements and remains challenging even for the next few seconds in urban mixed
traffic. How an agent moves is affected by the various behaviors of its
neighboring agents in different environments. To predict movements, we propose
an end-to-end generative model named Attentive Maps Encoder Network (AMENet)
that encodes the agent's motion and interaction information for accurate and
realistic multi-path trajectory prediction. A conditional variational
auto-encoder module is trained to learn the latent space of possible future
paths based on attentive dynamic maps for interaction modeling and then is used
to predict multiple plausible future trajectories conditioned on the observed
past trajectories. The efficacy of AMENet is validated using two public
trajectory prediction benchmarks Trajnet and InD.Comment: Accepted by ISPRS Journal of Photogrammetry and Remote Sensin
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
In the rapidly advancing field of multi-modal machine learning (MMML), the
convergence of multiple data modalities has the potential to reshape various
applications. This paper presents a comprehensive overview of the current
state, advancements, and challenges of MMML within the sphere of engineering
design. The review begins with a deep dive into five fundamental concepts of
MMML:multi-modal information representation, fusion, alignment, translation,
and co-learning. Following this, we explore the cutting-edge applications of
MMML, placing a particular emphasis on tasks pertinent to engineering design,
such as cross-modal synthesis, multi-modal prediction, and cross-modal
information retrieval. Through this comprehensive overview, we highlight the
inherent challenges in adopting MMML in engineering design, and proffer
potential directions for future research. To spur on the continued evolution of
MMML in engineering design, we advocate for concentrated efforts to construct
extensive multi-modal design datasets, develop effective data-driven MMML
techniques tailored to design applications, and enhance the scalability and
interpretability of MMML models. MMML models, as the next generation of
intelligent design tools, hold a promising future to impact how products are
designed
- …