37 research outputs found
Efficient Privacy Preserving Viola-Jones Type Object Detection via Random Base Image Representation
A cloud server spent a lot of time, energy and money to train a Viola-Jones
type object detector with high accuracy. Clients can upload their photos to the
cloud server to find objects. However, the client does not want the leakage of
the content of his/her photos. In the meanwhile, the cloud server is also
reluctant to leak any parameters of the trained object detectors. 10 years ago,
Avidan & Butman introduced Blind Vision, which is a method for securely
evaluating a Viola-Jones type object detector. Blind Vision uses standard
cryptographic tools and is painfully slow to compute, taking a couple of hours
to scan a single image. The purpose of this work is to explore an efficient
method that can speed up the process. We propose the Random Base Image (RBI)
Representation. The original image is divided into random base images. Only the
base images are submitted randomly to the cloud server. Thus, the content of
the image can not be leaked. In the meanwhile, a random vector and the secure
Millionaire protocol are leveraged to protect the parameters of the trained
object detector. The RBI makes the integral-image enable again for the great
acceleration. The experimental results reveal that our method can retain the
detection accuracy of that of the plain vision algorithm and is significantly
faster than the traditional blind vision, with only a very low probability of
the information leakage theoretically.Comment: 6 pages, 3 figures, To appear in the proceedings of the IEEE
International Conference on Multimedia and Expo (ICME), Jul 10, 2017 - Jul
14, 2017, Hong Kong, Hong Kon
Ten Research Questions for Scalable Multimedia Analytics
International audienceThe scale and complexity of multimedia collections is ever increasing, as is the desire to harvest useful insight from the collections. To optimally support the complex quest for insight, multimedia ana-lytics has emerged as a new research area that combines concepts and techniques from multimedia analysis and visual analytics into a single framework. State of the art multimedia analytics solutions are highly interactive and give users freedom in how they perform their analytics task, but they do not scale well. State of the art scalable database management solutions, on the other hand, are not yet designed for multimedia analytics workloads. In this position paper we therefore argue the need for research on scalable multimedia analytics, a new research area built on the three pillars of visual analytics, multimedia analysis and database management. We propose a specific goal for scalable multimedia analyt-ics and present several important research questions that we believe must be addressed in order to achieve that goal
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods
Opinion and sentiment analysis is a vital task to characterize subjective
information in social media posts. In this paper, we present a comprehensive
experimental evaluation and comparison with six state-of-the-art methods, from
which we have re-implemented one of them. In addition, we investigate different
textual and visual feature embeddings that cover different aspects of the
content, as well as the recently introduced multimodal CLIP embeddings.
Experimental results are presented for two different publicly available
benchmark datasets of tweets and corresponding images. In contrast to the
evaluation methodology of previous work, we introduce a reproducible and fair
evaluation scheme to make results comparable. Finally, we conduct an error
analysis to outline the limitations of the methods and possibilities for the
future work.Comment: Accepted in Workshop on Multi-ModalPre-Training for Multimedia
Understanding (MMPT 2021), co-located with ICMR 202
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Social media is daily creating massive multimedia content with paired image
and text, presenting the pressing need to automate the vision and language
understanding for various multimodal classification tasks. Compared to the
commonly researched visual-lingual data, social media posts tend to exhibit
more implicit image-text relations. To better glue the cross-modal semantics
therein, we capture hinting features from user comments, which are retrieved
via jointly leveraging visual and lingual similarity. Afterwards, the
classification tasks are explored via self-training in a teacher-student
framework, motivated by the usually limited labeled data scales in existing
benchmarks. Substantial experiments are conducted on four multimodal social
media benchmarks for image text relation classification, sarcasm detection,
sentiment classification, and hate speech detection. The results show that our
method further advances the performance of previous state-of-the-art models,
which do not employ comment modeling or self-training.Comment: accepted to EMNLP 202
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Large language models have demonstrated robust performance on various
language tasks using zero-shot or few-shot learning paradigms. While being
actively researched, multimodal models that can additionally handle images as
input have yet to catch up in size and generality with language-only models. In
this work, we ask whether language-only models can be utilised for tasks that
require visual input -- but also, as we argue, often require a strong reasoning
component. Similar to some recent related work, we make visual information
accessible to the language model using separate verbalisation models.
Specifically, we investigate the performance of open-source, open-access
language models against GPT-3 on five vision-language tasks when given
textually-encoded visual information. Our results suggest that language models
are effective for solving vision-language tasks even with limited samples. This
approach also enhances the interpretability of a model's output by providing a
means of tracing the output back through the verbalised image content.Comment: Accepted at ACL 2023 Finding
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary tasks
Effectively leveraging multimodal information from social media posts is
essential to various downstream tasks such as sentiment analysis, sarcasm
detection and hate speech classification. However, combining text and image
information is challenging because of the idiosyncratic cross-modal semantics
with hidden or complementary information present in matching image-text pairs.
In this work, we aim to directly model this by proposing the use of two
auxiliary losses jointly with the main task when fine-tuning any pre-trained
multimodal model. Image-Text Contrastive (ITC) brings image-text
representations of a post closer together and separates them from different
posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates
the understanding of semantic correspondence between images and text by
penalizing unrelated pairs. We combine these objectives with five multimodal
models, demonstrating consistent improvements across four popular social media
datasets. Furthermore, through detailed analysis, we shed light on the specific
scenarios and cases where each auxiliary task proves to be most effective
A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness
People increasingly use videos on the Web as a source for learning. To
support this way of learning, researchers and developers are continuously
developing tools, proposing guidelines, analyzing data, and conducting
experiments. However, it is still not clear what characteristics a video should
have to be an effective learning medium. In this paper, we present a
comprehensive review of 257 articles on video-based learning for the period
from 2016 to 2021. One of the aims of the review is to identify the video
characteristics that have been explored by previous work. Based on our
analysis, we suggest a taxonomy which organizes the video characteristics and
contextual aspects into eight categories: (1) audio features, (2) visual
features, (3) textual features, (4) instructor behavior, (5) learners
activities, (6) interactive features (quizzes, etc.), (7) production style, and
(8) instructional design. Also, we identify four representative research
directions: (1) proposals of tools to support video-based learning, (2) studies
with controlled experiments, (3) data analysis studies, and (4) proposals of
design guidelines for learning videos. We find that the most explored
characteristics are textual features followed by visual features, learner
activities, and interactive features. Text of transcripts, video frames, and
images (figures and illustrations) are most frequently used by tools that
support learning through videos. The learner activity is heavily explored
through log files in data analysis studies, and interactive features have been
frequently scrutinized in controlled experiments. We complement our review by
contrasting research findings that investigate the impact of video
characteristics on the learning effectiveness, report on tasks and technologies
used to develop tools that support learning, and summarize trends of design
guidelines to produce learning video