91 research outputs found
Multimodal Sentiment Analysis: Perceived vs Induced Sentiments
Social media has created a global network where people can easily access and
exchange vast information. This information gives rise to a variety of
opinions, reflecting both positive and negative viewpoints. GIFs stand out as a
multimedia format offering a visually engaging way for users to communicate. In
this research, we propose a multimodal framework that integrates visual and
textual features to predict the GIF sentiment. It also incorporates attributes
including face emotion detection and OCR generated captions to capture the
semantic aspects of the GIF. The developed classifier achieves an accuracy of
82.7% on Twitter GIFs, which is an improvement over state-of-the-art models.
Moreover, we have based our research on the ReactionGIF dataset, analysing the
variance in sentiment perceived by the author and sentiment induced in the
reade
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Multimodality Representation Learning, as a technique of learning to embed
information from different modalities and their correlations, has achieved
remarkable success on a variety of applications, such as Visual Question
Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision
Language Retrieval (VLR). Among these applications, cross-modal interaction and
complementary information from different modalities are crucial for advanced
models to perform any multimodal task, e.g., understand, recognize, retrieve,
or generate optimally. Researchers have proposed diverse methods to address
these tasks. The different variants of transformer-based architectures
performed extraordinarily on multiple modalities. This survey presents the
comprehensive literature on the evolution and enhancement of deep learning
multimodal architectures to deal with textual, visual and audio features for
diverse cross-modal and modern multimodal tasks. This study summarizes the (i)
recent task-specific deep learning methodologies, (ii) the pretraining types
and multimodal pretraining objectives, (iii) from state-of-the-art pretrained
multimodal approaches to unifying architectures, and (iv) multimodal task
categories and possible future improvements that can be devised for better
multimodal learning. Moreover, we prepare a dataset section for new researchers
that covers most of the benchmarks for pretraining and finetuning. Finally,
major challenges, gaps, and potential research topics are explored. A
constantly-updated paperlist related to our survey is maintained at
https://github.com/marslanm/multimodality-representation-learning
The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements
Truly real-life data presents a strong, but exciting challenge for sentiment
and emotion research. The high variety of possible `in-the-wild' properties
makes large datasets such as these indispensable with respect to building
robust machine learning models. A sufficient quantity of data covering a deep
variety in the challenges of each modality to force the exploratory analysis of
the interplay of all modalities has not yet been made available in this
context. In this contribution, we present MuSe-CaR, a first of its kind
multimodal dataset. The data is publicly available as it recently served as the
testing bed for the 1st Multimodal Sentiment Analysis Challenge, and focused on
the tasks of emotion, emotion-target engagement, and trustworthiness
recognition by means of comprehensively integrating the audio-visual and
language modalities. Furthermore, we give a thorough overview of the dataset in
terms of collection and annotation, including annotation tiers not used in this
year's MuSe 2020. In addition, for one of the sub-challenges - predicting the
level of trustworthiness - no participant outperformed the baseline model, and
so we propose a simple, but highly efficient Multi-Head-Attention network that
exceeds using multimodal fusion the baseline by around 0.2 CCC (almost 50 %
improvement).Comment: accepted versio
์ ์ฌ ์๋ฒ ๋ฉ์ ํตํ ์๊ฐ์ ์คํ ๋ฆฌ๋ก๋ถํฐ์ ์์ฌ ํ ์คํธ ์์ฑ๊ธฐ ํ์ต
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ์ฅ๋ณํ.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed
worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation.
Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos.
The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has
convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges.
The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting.
The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result.
Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.์คํ ๋ฆฌ๋ฅผ ์ดํดํ๋ ๋ฅ๋ ฅ์ ๋๋ฌผ๋ค ๋ฟ๋ง ์๋๋ผ ๋ค๋ฅธ ์ ์ธ์๊ณผ ์ธ๋ฅ๋ฅผ ๊ตฌ๋ณ์ง๋ ์ค์ํ ๋ฅ๋ ฅ์ด๋ค. ์ธ๊ณต์ง๋ฅ์ด ์ผ์์ํ ์์์ ์ฌ๋๋ค๊ณผ ํจ๊ป ์ง๋ด๋ฉด์ ๊ทธ๋ค์ ์ํ ์ ๋งฅ๋ฝ์ ์ดํดํ๊ธฐ ์ํด์๋ ์คํ ๋ฆฌ๋ฅผ ์ดํดํ๋ ๋ฅ๋ ฅ์ด ๋งค์ฐ ์ค์ํ๋ค. ํ์ง๋ง,
๊ธฐ์กด์ ์คํ ๋ฆฌ์ ๊ดํ ์ฐ๊ตฌ๋ ์ธ์ด์ฒ๋ฆฌ์ ์ด๋ ค์์ผ๋ก ์ธํด ์ฌ์ ์ ์ ์๋ ์ธ๊ณ ๋ชจ๋ธ ํ์์ ์ข์ ํ์ง์ ์ ์๋ฌผ์ ์์ฑํ๋ ค๋ ๊ธฐ์ ์ด ์ฃผ๋ก ์ฐ๊ตฌ๋์ด ์๋ค. ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ํตํด ์คํ ๋ฆฌ๋ฅผ ๋ค๋ฃจ๋ ค๋ ์๋๋ค์ ๋์ฒด๋ก ์์ฐ์ด๋ก ํํ๋ ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ ์ ๋ฐ์ ์์ด ์์ฐ์ด ์ฒ๋ฆฌ์์ ๊ฒช๋ ๋ฌธ์ ๋ค์ ๋์ผํ๊ฒ ๊ฒช๋๋ค. ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด์๋ ์๊ฐ์ ์ ๋ณด๊ฐ ํจ๊ป ์ฐ๋๋ ๋ฐ์ดํฐ๊ฐ ๋์์ด ๋ ์ ์๋ค. ์ต๊ทผ ๋ฅ๋ฌ๋์ ๋๋ถ์ ๋ฐ์ ์ ํ์
์ด ์๊ฐ๊ณผ ์ธ์ด ์ฌ์ด์ ๊ด๊ณ๋ฅผ ๋ค๋ฃจ๋ ์ฐ๊ตฌ๋ค์ด ๋์ด๋๊ณ
์๋ค. ์ฐ๊ตฌ์ ๋น์ ์ผ๋ก์, ์ธ๊ณต์ง๋ฅ ์์ด์ ํธ๊ฐ ์ฃผ๋ณ ์ ๋ณด๋ฅผ ์นด๋ฉ๋ผ๋ก ์
๋ ฅ๋ฐ๋ ํ๊ฒฝ ์์ ๋์ฌ์๋ ์ํฉ์ ์๊ฐํด ๋ณผ ์ ์๋ค. ์ด ์์์ ์ธ๊ณต์ง๋ฅ ์์ด์ ํธ๋ ์ฃผ๋ณ์ ๊ด์ฐฐํ๋ฉด์ ๊ทธ์ ๋ํ ์คํ ๋ฆฌ๋ฅผ ์์ฐ์ด ํํ๋ก ์์ฑํ๊ณ , ์์ฑ๋ ์คํ ๋ฆฌ๋ฅผ
๋ฐํ์ผ๋ก ๋ค์์ ์ผ์ด๋ ์คํ ๋ฆฌ๋ฅผ ํ ๋จ๊ณ์์ ์ฌ๋ฌ ๋จ๊ณ๊น์ง ์์ธกํ ์ ์๋ค. ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ฌ์ง ๋ฐ ๋น๋์ค ์์ ๋ํ๋๋ ์คํ ๋ฆฌ(visual story)๋ฅผ ํ์ตํ๋ ๋ฐฉ๋ฒ, ๋ด๋ฌํฐ๋ธ ํ
์คํธ๋ก์ ๋ณํ, ๊ฐ๋ ค์ง ์ฌ๊ฑด ๋ฐ ๋ค์ ์ฌ๊ฑด์ ์ถ๋ก ํ๋ ์ฐ๊ตฌ๋ค์
๋ค๋ฃฌ๋ค.
์ฒซ ๋ฒ์งธ๋ก, ์ฌ๋ฌ ์ฅ์ ์ฌ์ง์ด ์ฃผ์ด์ก์ ๋ ์ด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ ํ
์คํธ๋ฅผ ์์ฑํ๋ ๋ฌธ์ (๋น์ฃผ์ผ ์คํ ๋ฆฌํ
๋ง)๋ฅผ ๋ค๋ฃฌ๋ค. ์ด ๋ฌธ์ ํด๊ฒฐ์ ์ํด ๊ธ๋๋ท(GLAC Net)์ ์ ์ํ์๋ค. ๋จผ์ , ์ฌ์ง๋ค๋ก๋ถํฐ ์ ๋ณด๋ฅผ ์ถ์ถํ๊ธฐ ์ํ ์ปจ๋ณผ๋ฃจ์
์ ๊ฒฝ๋ง, ๋ฌธ์ฅ์
์์ฑํ๊ธฐ ์ํด ์ํ์ ๊ฒฝ๋ง์ ์ด์ฉํ๋ค. ์ํ์ค-์ํ์ค ๊ตฌ์กฐ์ ์ธ์ฝ๋๋ก์, ์ ์ฒด์ ์ธ ์ด์ผ๊ธฐ ๊ตฌ์กฐ์ ํํ์ ์ํด ๋ค๊ณ์ธต ์๋ฐฉํฅ ์ํ์ ๊ฒฝ๋ง์ ๋ฐฐ์นํ๋ ๊ฐ ์ฌ์ง ๋ณ ์ ๋ณด๋ฅผ ํจ๊ป ์ด์ฉํ๊ธฐ ์ํด ์ ์ญ์ -๊ตญ๋ถ์ ์ฃผ์์ง์ค ๋ชจ๋ธ์ ์ ์ํ์๋ค. ๋ํ,
์ฌ๋ฌ ๋ฌธ์ฅ์ ์์ฑํ๋ ๋์ ๋งฅ๋ฝ์ ๋ณด์ ๊ตญ๋ถ์ ๋ณด๋ฅผ ์์ง ์๊ฒ ํ๊ธฐ ์ํด ์์ ๋ฌธ์ฅ ์ ๋ณด๋ฅผ ์ ๋ฌํ๋ ๋ฉ์ปค๋์ฆ์ ์ ์ํ์๋ค. ์ ์ ์ ๋ฐฉ๋ฒ์ผ๋ก ๋น์คํธ(VIST) ๋ฐ์ดํฐ ์งํฉ์ ํ์ตํ์๊ณ , ์ 1 ํ ์๊ฐ์ ์คํ ๋ฆฌํ
๋ง ๋ํ(visual storytelling challenge)์์ ์ฌ๋ ํ๊ฐ๋ฅผ ๊ธฐ์ค์ผ๋ก ์ ์ฒด ์ ์ ๋ฐ 6 ํญ๋ชฉ ๋ณ๋ก ๋ชจ๋ ์ต๊ณ ์ ์ ๋ฐ์๋ค.
๋ ๋ฒ์งธ๋ก, ์คํ ๋ฆฌ์ ์ผ๋ถ๊ฐ ๋ฌธ์ฅ๋ค๋ก ์ฃผ์ด์ก์ ๋ ์ด๋ฅผ ๋ฐํ์ผ๋ก ๋ค์ ๋ฌธ์ฅ์ ์์ธกํ๋ ๋ฌธ์ ๋ฅผ ๋ค๋ฃฌ๋ค. ์์์ ๊ธธ์ด์ ์คํ ๋ฆฌ์ ๋ํด ์์์ ์์น์์ ์์ธก์ด ๊ฐ๋ฅํด์ผ ํ๊ณ , ์์ธกํ๋ ค๋ ๋จ๊ณ ์์ ๋ฌด๊ดํ๊ฒ ์๋ํด์ผ ํ๋ค. ์ด๋ฅผ ์ํ ๋ฐฉ๋ฒ์ผ๋ก
์ํ ์ฌ๊ฑด ์ธ์ถ ๋ชจ๋ธ(Recurrent Event Retrieval Models)์ ์ ์ํ์๋ค. ์ด ๋ฐฉ๋ฒ์ ์๋ ๊ณต๊ฐ ์์์ ํ์ฌ๊น์ง ๋์ ๋ ๋งฅ๋ฝ๊ณผ ๋ค์์ ๋ฐ์ํ ์ ๋ ฅ ์ฌ๊ฑด ์ฌ์ด์ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ๊น๊ฒ ํ๋๋ก ๋งฅ๋ฝ๋์ ํจ์์ ๋ ๊ฐ์ ์๋ฒ ๋ฉ ํจ์๋ฅผ ํ์ตํ๋ค. ์ด๋ฅผ ํตํด ์ด๋ฏธ ์
๋ ฅ๋์ด ์๋ ์คํ ๋ฆฌ์ ์๋ก์ด ์ฌ๊ฑด์ด ์
๋ ฅ๋๋ฉด ์์ ํ์ ์ฐ์ฐ์ ํตํด ๊ธฐ์กด์ ๋งฅ๋ฝ์ ๊ฐ์ ํ์ฌ ๋ค์์ ๋ฐ์ํ ์ ๋ ฅํ ์ฌ๊ฑด๋ค์ ์ฐพ๋๋ค. ์ด ๋ฐฉ๋ฒ์ผ๋ก ๋ฝ์คํ ๋ฆฌ(ROCStories) ๋ฐ์ดํฐ์งํฉ์ ํ์ตํ์๊ณ , ์คํ ๋ฆฌ ํด๋ก์ฆ ํ
์คํธ(Story Cloze Test)๋ฅผ ํตํด ํ๊ฐํ ๊ฒฐ๊ณผ ๊ฒฝ์๋ ฅ ์๋ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ํนํ ์์์ ๊ธธ์ด๋ก ์ถ๋ก ํ ์ ์๋ ๊ธฐ๋ฒ ์ค์ ์ต๊ณ ์ฑ๋ฅ์ ๋ณด์๋ค.
์ธ ๋ฒ์งธ๋ก, ๋น๋์ค ์คํ ๋ฆฌ์์ ์ฌ๊ฑด ์ํ์ค ์ค ์ผ๋ถ๊ฐ ๊ฐ๋ ค์ก์ ๋ ์ด๋ฅผ ๋ณต๊ตฌํ๋ ๋ฌธ์ ๋ฅผ ๋ค๋ฃฌ๋ค. ํนํ, ๊ฐ ์ฌ๊ฑด์ ์๋ฏธ ์ ๋ณด์ ์์๋ฅผ ๋ชจ๋ธ์ ํํ ํ์ต์ ๋ฐ์ํ๊ณ ์ ํ์๋ค. ์ด๋ฅผ ์ํด ์๋ ๊ณต๊ฐ ์์ ๊ฐ ์ํผ์๋๋ค์ ๊ถค์ ํํ๋ก ์๋ฒ ๋ฉํ๊ณ ,
์ด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ๋ฅผ ์ฌ์์ฑ์ ํ์ฌ ์คํ ๋ฆฌ ์์ฑ์ ํ ์ ์๋ ๋ชจ๋ธ์ธ ๋น์คํ ๋ฆฌ๋ท(ViStoryNet)์ ์ ์ํ์๋ค. ๊ฐ ์ํผ์๋๋ฅผ ๊ถค์ ํํ๋ฅผ ๊ฐ์ง๊ฒ ํ๊ธฐ ์ํด ์ฌ๊ฑด ๋ฌธ์ฅ์ ์ฌ๊ณ ๋ฒกํฐ(thought vector)๋ก ๋ณํํ๊ณ , ์ฐ์ ์ด๋ฒคํธ ์์ ์๋ฒ ๋ฉ์
ํตํด ์ ํ ์ฌ๊ฑด๋ค์ด ์๋ก ๊ฐ๊น๊ฒ ์๋ฒ ๋ฉ๋๋๋ก ํ์ฌ ํ๋์ ์ํผ์๋๊ฐ ๊ถค์ ์ ๋ชจ์์ ๊ฐ์ง๋๋ก ํ์ตํ์๋ค. ๋ฝ๋ก๋กQA ๋ฐ์ดํฐ์งํฉ์ ํตํด ์คํ์ ์ผ๋ก ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ์๋ค. ์๋ฒ ๋ฉ ๋ ์ํผ์๋๋ค์ ๊ถค์ ํํ๋ก ์ ๋ํ๋ฌ์ผ๋ฉฐ, ์ํผ์๋๋ค์ ์ฌ์์ฑ ํด๋ณธ ๊ฒฐ๊ณผ ์ ์ฒด์ ์ธ ์ธก๋ฉด์์ ์ ์ฌํ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
์ ๊ฒฐ๊ณผ๋ฌผ๋ค์ ์นด๋ฉ๋ผ๋ก ์
๋ ฅ๋๋ ์ฃผ๋ณ ์ ๋ณด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ๋ฅผ ์ดํดํ๊ณ ์ผ๋ถ ๊ด์ธก๋์ง ์์ ๋ถ๋ถ์ ์ถ๋ก ํ๋ฉฐ, ํฅํ ์คํ ๋ฆฌ๋ฅผ ์์ธกํ๋ ๋ฐฉ๋ฒ๋ค์ ๋์๋๋ค.Abstract i
Chapter 1 Introduction 1
1.1 Story of Everyday lives in Videos and Story Understanding . . . 1
1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Background and Related Work 10
2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14
2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15
2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17
2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18
2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19
Chapter 3 Visual Storytelling via Global-local Attention Cascading
Networks 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26
3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27
3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28
3.3.2 Decoder: Story Generator with Attention and Cascading
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36
3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 4 Common Space Learning on Cumulative Contexts
and the Next Events: Recurrent Event Retrieval
Models 44
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45
4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5 ViStoryNet: Order Embedding of Successive Events
and the Networks for Story Regeneration 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60
5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62
5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64
5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67
5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68
5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68
5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73
5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 6 Concluding Remarks 80
6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80
6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81
์ด๋ก 101Docto
Recommended from our members
Large-scale Affective Computing for Visual Multimedia
In recent years, Affective Computing has arisen as a prolific interdisciplinary field for engineering systems that integrate human affections. While human-computer relationships have long revolved around cognitive interactions, it is becoming increasingly important to account for human affect, or feelings or emotions, to avert user experience frustration, provide disability services, predict virality of social media content, etc. In this thesis, we specifically focus on Affective Computing as it applies to large-scale visual multimedia, and in particular, still images, animated image sequences and video streams, above and beyond the traditional approaches of face expression and gesture recognition. By taking a principled psychology-grounded approach, we seek to paint a more holistic and colorful view of computational affect in the context of visual multimedia. For example, should emotions like 'surprise' and `fear' be assumed to be orthogonal output dimensions? Or does a 'positive' image in one culture's view elicit the same feelings of positivity in another culture? We study affect frameworks and ontologies to define, organize and develop machine learning models with such questions in mind to automatically detect affective visual concepts.
In the push for what we call "Big Affective Computing," we focus on two dimensions of scale for affect -- scaling up and scaling out -- which we propose are both imperative if we are to scale the Affective Computing problem successfully. Intuitively, simply increasing the number of data points corresponds to "scaling up". However, less intuitive, is when problems like Affective Computing "scale out," or diversify. We show that this latter dimension of introducing data variety, alongside the former of introducing data volume, can yield particular insights since human affections naturally depart from traditional Machine Learning and Computer Vision problems where there is an objectively truthful target. While no one might debate a picture of a 'dog' should be tagged as a 'dog,' but not all may agree that it looks 'ugly'. We present extensive discussions on why scaling out is critical and how it can be accomplished while in the context of large-volume visual data.
At a high-level, the main contributions of this thesis include:
Multiplicity of Affect Oracles:
Prior to the work in this thesis, little consideration has been paid to the affective label generating mechanism when learning functional mappings between inputs and labels. Throughout this thesis but first in Chapter 2, starting in Section 2.1.2, we make a case for a conceptual partitioning of the affect oracle governing the label generation process in Affective Computing problems resulting a multiplicity of oracles, whereas prior works assumed there was a single universal oracle. In Chapter 3, the differences between intended versus expressed versus induced versus perceived emotion are discussed, where we argue that perceived emotion is particularly well-suited for scaling up because it reduces the label variance due to its more objective nature compared to other affect states. And in Chapter 4 and 5, a division of the affect oracle along cultural lines with manifestations along both language and geography is explored. We accomplish all this without sacrificing the 'scale up' dimension, and tackle significantly larger volume problems than prior comparable visual affective computing research.
Content-driven Visual Affect Detection:
Traditionally, in most Affective Computing work, prediction tasks use psycho-physiological signals from subjects viewing the stimuli of interest, e.g., a video advertisement, as the system inputs. In essence, this means that the machine learns to label a proxy signal rather than the stimuli itself. In this thesis, with the rise of strong Computer Vision and Multimedia techniques, we focus on the learning to label the stimuli directly without a human subject provided biometric proxy signal (except in the unique circumstances of Chapter 7). This shift toward learning from the stimuli directly is important because it allows us to scale up with much greater ease given that biometric measurement acquisition is both low-throughput and somewhat invasive while stimuli are often readily available. In addition, moving toward learning directly from the stimuli will allow researchers to precisely determine which low-level features in the stimuli are actually coupled with affect states, e.g., which set of frames caused viewer discomfort rather a broad sense that a video was discomforting. In Part I of this thesis, we illustrate an emotion prediction task with a psychology-grounded affect representation. In particular, in Chapter 3, we develop a prediction task over semantic emotional classes, e.g., 'sad,' 'happy' and 'angry,' using animated image sequences given annotations from over 2.5 million users. Subsequently, in Part II, we develop visual sentiment and adjective-based semantics models from million-scale digital imagery mined from a social multimedia platform.
Mid-level Representations for Visual Affect:
While discrete semantic emotions and sentiment are classical representations of affect with decades of psychology grounding, the interdisciplinary nature of Affective Computing, now only about two decades old, allows for new avenues of representation. Mid-level representations have been proposed in numerous Computer Vision and Multimedia problems as an intermediary, and often more computable, step toward bridging the semantic gap between low-level system inputs and high-level label semantic abstractions. In Part II, inspired by this work, we adapt it for vision-based Affective Computing and adopt a semantic construct called adjective-noun pairs. Specifically, in Chapter 4, we explore the use of such adjective-noun pairs in the context of a social multimedia platform and develop a multilingual visual sentiment ontology with over 15,000 affective mid-level visual concepts across 12 languages associated with over 7.3 million images and representations from over 235 countries, resulting in the largest affective digital image corpus in both depth and breadth to date. In Chapter 5, we develop computational methods to predict such adjective-noun pairs and also explore their usefulness in traditional sentiment analysis but with a previously unexplored cross-lingual perspective. And in Chapter 6, we propose a new learning setting called 'cross-residual learning' building off recent successes in deep neural networks, and specifically, in residual learning; we show that cross-residual learning can be used effectively to jointly learn across even multiple related tasks in object detection (noun), more traditional affect modeling (adjectives), and affective mid-level representations (adjective-noun pairs), giving us a framework for better grounding the adjective-noun pair bridge in both vision and affect simultaneously
Selecting Stickers in Open-Domain Dialogue through Multitask Learning
With the increasing popularity of online chatting, stickers are becoming
important in our online communication. Selecting appropriate stickers in
open-domain dialogue requires a comprehensive understanding of both dialogues
and stickers, as well as the relationship between the two types of modalities.
To tackle these challenges, we propose a multitask learning method comprised of
three auxiliary tasks to enhance the understanding of dialogue history, emotion
and semantic meaning of stickers. Extensive experiments conducted on a recent
challenging dataset show that our model can better combine the multimodal
information and achieve significantly higher accuracy over strong baselines.
Ablation study further verifies the effectiveness of each auxiliary task. Our
code is available at \url{https://github.com/nonstopfor/Sticker-Selection}Comment: ACL 2022 findings, camera-read
APSE: Attention-aware polarity-sensitive embedding for emotion-based image retrieval
With the popularity of social media, an increasing number of people are accustomed to expressing their feelings and emotions online using images and videos. An emotion-based image retrieval (EBIR) system is useful for obtaining visual contents with desired emotions from a massive repository. Existing EBIR methods mainly focus on modeling the global characteristics of visual content without considering the crucial role of informative regions of interest in conveying emotions. Further, they ignore the hierarchical relationships between coarse polarities and fine categories of emotions. In this paper, we design an attention-aware polarity-sensitive embedding (APSE) network to address these issues. First, we develop a hierarchical attention mechanism to automatically discover and model the informative regions of interest. Specifically, both polarity-and emotion-specific attended representations are aggregated for discriminative feature embedding. Second, we propose a generated emotion-pair (GEP) loss to simultaneously consider the inter-and intra-polarity relationships of the emotion labels. Moreover, we adaptively generate negative examples of different hard levels in the feature space guided by the attention module to further improve the performance of feature embedding. Extensive experiments on four popular benchmark datasets demonstrate that the proposed APSE method outperforms the state-of-the-art EBIR approaches by a large margin
- โฆ