590 research outputs found
Tell Me What Is Good About This Property: Leveraging Reviews For Segment-Personalized Image Collection Summarization
Image collection summarization techniques aim to present a compact
representation of an image gallery through a carefully selected subset of
images that captures its semantic content. When it comes to web content,
however, the ideal selection can vary based on the user's specific intentions
and preferences. This is particularly relevant at Booking.com, where presenting
properties and their visual summaries that align with users' expectations is
crucial. To address this challenge, we consider user intentions in the
summarization of property visuals by analyzing property reviews and extracting
the most significant aspects mentioned by users. By incorporating the insights
from reviews in our visual summaries, we enhance the summaries by presenting
the relevant content to a user. Moreover, we achieve it without the need for
costly annotations. Our experiments, including human perceptual studies,
demonstrate the superiority of our cross-modal approach, which we coin as
CrossSummarizer over the no-personalization and image-based clustering
baselines
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Most current multi-modal summarization methods follow a cascaded manner,
where an off-the-shelf object detector is first used to extract visual
features, then these features are fused with language representations to
generate the summary with an encoder-decoder model. The cascaded way cannot
capture the semantic alignments between images and paragraphs, which are
crucial to a precise summary. In this paper, we propose ViL-Sum to jointly
model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and
Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal
encoder with two well-designed tasks, image reordering and image selection. The
joint multi-modal encoder captures the interactions between modalities, where
the reordering task guides the model to learn paragraph-level semantic
alignment and the selection task guides the model to selected summary-related
images in the final summary. Experimental results show that our proposed
ViL-Sum significantly outperforms current state-of-the-art methods. In further
analysis, we find that two well-designed tasks and joint multi-modal encoder
can effectively guide the model to learn reasonable paragraphs-images and
summary-images relations
Meetings and Meeting Modeling in Smart Environments
In this paper we survey our research on smart meeting rooms and its relevance for augmented reality meeting support and virtual reality generation of meetings in real time or off-line. The research reported here forms part of the European 5th and 6th framework programme projects multi-modal meeting manager (M4) and augmented multi-party interaction (AMI). Both projects aim at building a smart meeting environment that is able to collect multimodal captures of the activities and discussions in a meeting room, with the aim to use this information as input to tools that allow real-time support, browsing, retrieval and summarization of meetings. Our aim is to research (semantic) representations of what takes place during meetings in order to allow generation, e.g. in virtual reality, of meeting activities (discussions, presentations, voting, etc.). Being able to do so also allows us to look at tools that provide support during a meeting and at tools that allow those not able to be physically present during a meeting to take part in a virtual way. This may lead to situations where the differences between real meeting participants, human-controlled virtual participants and (semi-) autonomous virtual participants disappear
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
Multimedia summarization with multimodal output (MSMO) is a recently explored
application in language grounding. It plays an essential role in real-world
applications, i.e., automatically generating cover images and titles for news
articles or providing introductions to online videos. However, existing methods
extract features from the whole video and article and use fusion methods to
select the representative one, thus usually ignoring the critical structure and
varying semantics. In this work, we propose a Semantics-Consistent Cross-domain
Summarization (SCCS) model based on optimal transport alignment with visual and
textual segmentation. In specific, our method first decomposes both video and
article into segments in order to capture the structural semantics,
respectively. Then SCCS follows a cross-domain alignment objective with optimal
transport distance, which leverages multimodal interaction to match and select
the visual and textual summary. We evaluated our method on three recent
multimodal datasets and demonstrated the effectiveness of our method in
producing high-quality multimodal summaries
Exploiting Pseudo Image Captions for Multimodal Summarization
Cross-modal contrastive learning in vision language pretraining (VLP) faces
the challenge of (partial) false negatives. In this paper, we study this
problem from the perspective of Mutual Information (MI) optimization. It is
common sense that InfoNCE loss used in contrastive learning will maximize the
lower bound of MI between anchors and their positives, while we theoretically
prove that MI involving negatives also matters when noises commonly exist.
Guided by a more general lower bound form for optimization, we propose a
contrastive learning strategy regulated by progressively refined cross-modal
similarity, to more accurately optimize MI between an image/text anchor and its
negative texts/images instead of improperly minimizing it. Our method performs
competitively on four downstream cross-modal tasks and systematically balances
the beneficial and harmful effects of (partial) false negative samples under
theoretical guidance.Comment: Accepted at ACL2023 Finding
Query-controllable Video Summarization
When video collections become huge, how to explore both within and across
videos efficiently is challenging. Video summarization is one of the ways to
tackle this issue. Traditional summarization approaches limit the effectiveness
of video exploration because they only generate one fixed video summary for a
given input video independent of the information need of the user. In this
work, we introduce a method which takes a text-based query as input and
generates a video summary corresponding to it. We do so by modeling video
summarization as a supervised learning problem and propose an end-to-end deep
learning based method for query-controllable video summarization to generate a
query-dependent video summary. Our proposed method consists of a video summary
controller, video summary generator, and video summary output module. To foster
the research of query-controllable video summarization and conduct our
experiments, we introduce a dataset that contains frame-based relevance score
labels. Based on our experimental result, it shows that the text-based query
helps control the video summary. It also shows the text-based query improves
our model performance. Our code and dataset:
https://github.com/Jhhuangkay/Query-controllable-Video-Summarization.Comment: This paper is accepted by ACM International Conference on Multimedia
Retrieval (ICMR), 202
- …