34 research outputs found
Ten Research Questions for Scalable Multimedia Analytics
International audienceThe scale and complexity of multimedia collections is ever increasing, as is the desire to harvest useful insight from the collections. To optimally support the complex quest for insight, multimedia ana-lytics has emerged as a new research area that combines concepts and techniques from multimedia analysis and visual analytics into a single framework. State of the art multimedia analytics solutions are highly interactive and give users freedom in how they perform their analytics task, but they do not scale well. State of the art scalable database management solutions, on the other hand, are not yet designed for multimedia analytics workloads. In this position paper we therefore argue the need for research on scalable multimedia analytics, a new research area built on the three pillars of visual analytics, multimedia analysis and database management. We propose a specific goal for scalable multimedia analyt-ics and present several important research questions that we believe must be addressed in order to achieve that goal
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods
Opinion and sentiment analysis is a vital task to characterize subjective
information in social media posts. In this paper, we present a comprehensive
experimental evaluation and comparison with six state-of-the-art methods, from
which we have re-implemented one of them. In addition, we investigate different
textual and visual feature embeddings that cover different aspects of the
content, as well as the recently introduced multimodal CLIP embeddings.
Experimental results are presented for two different publicly available
benchmark datasets of tweets and corresponding images. In contrast to the
evaluation methodology of previous work, we introduce a reproducible and fair
evaluation scheme to make results comparable. Finally, we conduct an error
analysis to outline the limitations of the methods and possibilities for the
future work.Comment: Accepted in Workshop on Multi-ModalPre-Training for Multimedia
Understanding (MMPT 2021), co-located with ICMR 202
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Social media is daily creating massive multimedia content with paired image
and text, presenting the pressing need to automate the vision and language
understanding for various multimodal classification tasks. Compared to the
commonly researched visual-lingual data, social media posts tend to exhibit
more implicit image-text relations. To better glue the cross-modal semantics
therein, we capture hinting features from user comments, which are retrieved
via jointly leveraging visual and lingual similarity. Afterwards, the
classification tasks are explored via self-training in a teacher-student
framework, motivated by the usually limited labeled data scales in existing
benchmarks. Substantial experiments are conducted on four multimodal social
media benchmarks for image text relation classification, sarcasm detection,
sentiment classification, and hate speech detection. The results show that our
method further advances the performance of previous state-of-the-art models,
which do not employ comment modeling or self-training.Comment: accepted to EMNLP 202
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Large language models have demonstrated robust performance on various
language tasks using zero-shot or few-shot learning paradigms. While being
actively researched, multimodal models that can additionally handle images as
input have yet to catch up in size and generality with language-only models. In
this work, we ask whether language-only models can be utilised for tasks that
require visual input -- but also, as we argue, often require a strong reasoning
component. Similar to some recent related work, we make visual information
accessible to the language model using separate verbalisation models.
Specifically, we investigate the performance of open-source, open-access
language models against GPT-3 on five vision-language tasks when given
textually-encoded visual information. Our results suggest that language models
are effective for solving vision-language tasks even with limited samples. This
approach also enhances the interpretability of a model's output by providing a
means of tracing the output back through the verbalised image content.Comment: Accepted at ACL 2023 Finding
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary tasks
Effectively leveraging multimodal information from social media posts is
essential to various downstream tasks such as sentiment analysis, sarcasm
detection and hate speech classification. However, combining text and image
information is challenging because of the idiosyncratic cross-modal semantics
with hidden or complementary information present in matching image-text pairs.
In this work, we aim to directly model this by proposing the use of two
auxiliary losses jointly with the main task when fine-tuning any pre-trained
multimodal model. Image-Text Contrastive (ITC) brings image-text
representations of a post closer together and separates them from different
posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates
the understanding of semantic correspondence between images and text by
penalizing unrelated pairs. We combine these objectives with five multimodal
models, demonstrating consistent improvements across four popular social media
datasets. Furthermore, through detailed analysis, we shed light on the specific
scenarios and cases where each auxiliary task proves to be most effective
A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness
People increasingly use videos on the Web as a source for learning. To
support this way of learning, researchers and developers are continuously
developing tools, proposing guidelines, analyzing data, and conducting
experiments. However, it is still not clear what characteristics a video should
have to be an effective learning medium. In this paper, we present a
comprehensive review of 257 articles on video-based learning for the period
from 2016 to 2021. One of the aims of the review is to identify the video
characteristics that have been explored by previous work. Based on our
analysis, we suggest a taxonomy which organizes the video characteristics and
contextual aspects into eight categories: (1) audio features, (2) visual
features, (3) textual features, (4) instructor behavior, (5) learners
activities, (6) interactive features (quizzes, etc.), (7) production style, and
(8) instructional design. Also, we identify four representative research
directions: (1) proposals of tools to support video-based learning, (2) studies
with controlled experiments, (3) data analysis studies, and (4) proposals of
design guidelines for learning videos. We find that the most explored
characteristics are textual features followed by visual features, learner
activities, and interactive features. Text of transcripts, video frames, and
images (figures and illustrations) are most frequently used by tools that
support learning through videos. The learner activity is heavily explored
through log files in data analysis studies, and interactive features have been
frequently scrutinized in controlled experiments. We complement our review by
contrasting research findings that investigate the impact of video
characteristics on the learning effectiveness, report on tasks and technologies
used to develop tools that support learning, and summarize trends of design
guidelines to produce learning video
MultiMedia Modeling [electronic resource] : 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II /
The two-volume set LNCS 9516 and 9517 constitutes the thoroughly refereed proceedings of the 22nd International Conference on Multimedia Modeling, MMM 2016, held in Miami, FL, USA, in January 2016. The 32 revised full papers and 52 poster papers were carefully reviewed and selected from 117 submissions. In addition 20 papers were accepted for five special sessions out of 38 submissions as well as 7 demonstrations (from 11 submissions) and 9 video showcase papers. The papers are organized in topical sections on video content analysis, social media analysis, object recognition and system, multimedia retrieval and ranking, multimedia representation, machine learning in multimedia, and interaction and mobile. The special sessions are: good practices in multimedia modeling; semantics discovery from multimedia big data; perception, aesthetics, and emotion in multimedia quality modeling; multimodal learning and computing for human activity understanding; and perspectives on multimedia analytics.The two-volume set LNCS 9516 and 9517 constitutes the thoroughly refereed proceedings of the 22nd International Conference on Multimedia Modeling, MMM 2016, held in Miami, FL, USA, in January 2016. The 32 revised full papers and 52 poster papers were carefully reviewed and selected from 117 submissions. In addition 20 papers were accepted for five special sessions out of 38 submissions as well as 7 demonstrations (from 11 submissions) and 9 video showcase papers. The papers are organized in topical sections on video content analysis, social media analysis, object recognition and system, multimedia retrieval and ranking, multimedia representation, machine learning in multimedia, and interaction and mobile. The special sessions are: good practices in multimedia modeling; semantics discovery from multimedia big data; perception, aesthetics, and emotion in multimedia quality modeling; multimodal learning and computing for human activity understanding; and perspectives on multimedia analytics
Machine Learning Architectures for Video Annotation and Retrieval
PhDIn this thesis we are designing machine learning methodologies for solving the problem
of video annotation and retrieval using either pre-defined semantic concepts or ad-hoc
queries. Concept-based video annotation refers to the annotation of video fragments
with one or more semantic concepts (e.g. hand, sky, running), chosen from a predefined concept list. Ad-hoc queries refer to textual descriptions that may contain
objects, activities, locations etc., and combinations of the former. Our contributions
are: i) A thorough analysis on extending and using different local descriptors towards
improved concept-based video annotation and a stacking architecture that uses in the
first layer, concept classifiers trained on local descriptors and improves their prediction
accuracy by implicitly capturing concept relations, in the last layer of the stack. ii)
A cascade architecture that orders and combines many classifiers, trained on different
visual descriptors, for the same concept. iii) A deep learning architecture that exploits
concept relations at two different levels. At the first level, we build on ideas from
multi-task learning, and propose an approach to learn concept-specific representations
that are sparse, linear combinations of representations of latent concepts. At a second
level, we build on ideas from structured output learning, and propose the introduction,
at training time, of a new cost term that explicitly models the correlations between
the concepts. By doing so, we explicitly model the structure in the output space
(i.e., the concept labels). iv) A fully-automatic ad-hoc video search architecture that
combines concept-based video annotation and textual query analysis, and transforms
concept-based keyframe and query representations into a common semantic embedding
space. Our architectures have been extensively evaluated on the TRECVID SIN 2013,
the TRECVID AVS 2016, and other large-scale datasets presenting their effectiveness
compared to other similar approaches