37 research outputs found

    Efficient Privacy Preserving Viola-Jones Type Object Detection via Random Base Image Representation

    Full text link
    A cloud server spent a lot of time, energy and money to train a Viola-Jones type object detector with high accuracy. Clients can upload their photos to the cloud server to find objects. However, the client does not want the leakage of the content of his/her photos. In the meanwhile, the cloud server is also reluctant to leak any parameters of the trained object detectors. 10 years ago, Avidan & Butman introduced Blind Vision, which is a method for securely evaluating a Viola-Jones type object detector. Blind Vision uses standard cryptographic tools and is painfully slow to compute, taking a couple of hours to scan a single image. The purpose of this work is to explore an efficient method that can speed up the process. We propose the Random Base Image (RBI) Representation. The original image is divided into random base images. Only the base images are submitted randomly to the cloud server. Thus, the content of the image can not be leaked. In the meanwhile, a random vector and the secure Millionaire protocol are leveraged to protect the parameters of the trained object detector. The RBI makes the integral-image enable again for the great acceleration. The experimental results reveal that our method can retain the detection accuracy of that of the plain vision algorithm and is significantly faster than the traditional blind vision, with only a very low probability of the information leakage theoretically.Comment: 6 pages, 3 figures, To appear in the proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Jul 10, 2017 - Jul 14, 2017, Hong Kong, Hong Kon

    Ten Research Questions for Scalable Multimedia Analytics

    Get PDF
    International audienceThe scale and complexity of multimedia collections is ever increasing, as is the desire to harvest useful insight from the collections. To optimally support the complex quest for insight, multimedia ana-lytics has emerged as a new research area that combines concepts and techniques from multimedia analysis and visual analytics into a single framework. State of the art multimedia analytics solutions are highly interactive and give users freedom in how they perform their analytics task, but they do not scale well. State of the art scalable database management solutions, on the other hand, are not yet designed for multimedia analytics workloads. In this position paper we therefore argue the need for research on scalable multimedia analytics, a new research area built on the three pillars of visual analytics, multimedia analysis and database management. We propose a specific goal for scalable multimedia analyt-ics and present several important research questions that we believe must be addressed in order to achieve that goal

    A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods

    Get PDF
    Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-implemented one of them. In addition, we investigate different textual and visual feature embeddings that cover different aspects of the content, as well as the recently introduced multimodal CLIP embeddings. Experimental results are presented for two different publicly available benchmark datasets of tweets and corresponding images. In contrast to the evaluation methodology of previous work, we introduce a reproducible and fair evaluation scheme to make results comparable. Finally, we conduct an error analysis to outline the limitations of the methods and possibilities for the future work.Comment: Accepted in Workshop on Multi-ModalPre-Training for Multimedia Understanding (MMPT 2021), co-located with ICMR 202

    Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification

    Full text link
    Social media is daily creating massive multimedia content with paired image and text, presenting the pressing need to automate the vision and language understanding for various multimodal classification tasks. Compared to the commonly researched visual-lingual data, social media posts tend to exhibit more implicit image-text relations. To better glue the cross-modal semantics therein, we capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity. Afterwards, the classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales in existing benchmarks. Substantial experiments are conducted on four multimodal social media benchmarks for image text relation classification, sarcasm detection, sentiment classification, and hate speech detection. The results show that our method further advances the performance of previous state-of-the-art models, which do not employ comment modeling or self-training.Comment: accepted to EMNLP 202

    Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

    Full text link
    Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input -- but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model's output by providing a means of tracing the output back through the verbalised image content.Comment: Accepted at ACL 2023 Finding

    Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary tasks

    Full text link
    Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection and hate speech classification. However, combining text and image information is challenging because of the idiosyncratic cross-modal semantics with hidden or complementary information present in matching image-text pairs. In this work, we aim to directly model this by proposing the use of two auxiliary losses jointly with the main task when fine-tuning any pre-trained multimodal model. Image-Text Contrastive (ITC) brings image-text representations of a post closer together and separates them from different posts, capturing underlying dependencies. Image-Text Matching (ITM) facilitates the understanding of semantic correspondence between images and text by penalizing unrelated pairs. We combine these objectives with five multimodal models, demonstrating consistent improvements across four popular social media datasets. Furthermore, through detailed analysis, we shed light on the specific scenarios and cases where each auxiliary task proves to be most effective

    A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness

    Full text link
    People increasingly use videos on the Web as a source for learning. To support this way of learning, researchers and developers are continuously developing tools, proposing guidelines, analyzing data, and conducting experiments. However, it is still not clear what characteristics a video should have to be an effective learning medium. In this paper, we present a comprehensive review of 257 articles on video-based learning for the period from 2016 to 2021. One of the aims of the review is to identify the video characteristics that have been explored by previous work. Based on our analysis, we suggest a taxonomy which organizes the video characteristics and contextual aspects into eight categories: (1) audio features, (2) visual features, (3) textual features, (4) instructor behavior, (5) learners activities, (6) interactive features (quizzes, etc.), (7) production style, and (8) instructional design. Also, we identify four representative research directions: (1) proposals of tools to support video-based learning, (2) studies with controlled experiments, (3) data analysis studies, and (4) proposals of design guidelines for learning videos. We find that the most explored characteristics are textual features followed by visual features, learner activities, and interactive features. Text of transcripts, video frames, and images (figures and illustrations) are most frequently used by tools that support learning through videos. The learner activity is heavily explored through log files in data analysis studies, and interactive features have been frequently scrutinized in controlled experiments. We complement our review by contrasting research findings that investigate the impact of video characteristics on the learning effectiveness, report on tasks and technologies used to develop tools that support learning, and summarize trends of design guidelines to produce learning video

    Blue Channel and Fusion for Sandstorm Image Enhancement

    Get PDF
    corecore