Search CORE

171 research outputs found

Automatic Understanding of Image and Video Advertisements

Author: Agha Zuha
Hussain Zaeem
Kovashka Adriana
Ong Nathan
Thomas Christopher
Ye Keren
Zhang Mingda
Zhang Xiaozhong
Publication venue
Publication date: 10/07/2017
Field of study

There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this problem, we create two datasets: an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. Our data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it?"), and symbolic references ads make (e.g. a dove symbolizes peace). We also analyze the most common persuasive strategies ads use, and the capabilities that computer vision systems should have to understand these strategies. We present baseline classification results for several prediction tasks, including automatically answering questions about the messages of the ads.Comment: To appear in CVPR 2017; data available on http://cs.pitt.edu/~kovashka/ad

arXiv.org e-Print Archive

Crossref

Domain Robustness in Multi-modality Learning and Visual Question Answering

Author: Zhang Mingda
Publication venue
Publication date: 17/01/2022
Field of study

Humans perceive the world via multiple modalities, as information from a single modality is usually partial and incomplete. This observation motivates the development of machine learning algorithms capable of handling multi-modal data and performing intelligent reasoning. The recent resurgence of deep learning brings both opportunities and challenges to multi-modal reasoning. On the one hand, its strong representation learning capability provides a unified approach to represent information across multiple modalities. On the other hand, properly training such models typically requires enormous data, which is not always feasible especially for the multi-modal setting. One promising direction to mitigate the lack of data for deep learning models is to transfer knowledge (e.g., gained from solving related problems) to low-resource domains. This procedure is known as transfer learning or domain adaptation, and it has demonstrated great success in various visual and linguistic applications. However, how to effectively transfer knowledge in a multi-modality setting remains a research question. In this thesis, we choose multi-modal reasoning as our target task and aim at improving the performance of deep neural networks on low-resource domains via domain adaptation. We first briefly discuss our prior work about advertisement understanding (as a typical multi-modal reasoning problem) and share our experience from addressing the data-availability challenge. Next, we turn to visual question answering, a more general problem that involves more complicated reasoning. We evaluate mainstream VQA models and classic single-modal domain adaptation strategies and show that existing methods usually suffer significant performance degradation when directly apply to the multi-modal setting. We measure the domain gaps in different modalities and design an effective strategy to manually control domain shifts on individual modalities, which helps better understand the problem. Lastly, we present a systematic study across real datasets to answer a few fundamental questions regarding knowledge transfer in VQA, such as the sensitivity of various models towards different types of supervisions (i.e. unsupervised, self-supervised, semi-supervised, and fully supervised). We conclude by sharing the limitations and our vision for future research directions

D-Scholarship@Pitt

Optimization Analysis of the Structural Design and Stability Parameters of a Rehabilitation Robot

Author: Gao Xueshan
Miao Mingda
Zhang Pengfei
Zhao Peng
Publication venue: University of Zagreb Faculty of Mechanical Engineering and Naval Architecture
Publication date: 01/01/2024
Field of study

In this paper, a lower limb rehabilitation robot, suitable for stroke patients, is designed to meet the needs of the lower limb training in a later stage of rehabilitation. The rehabilitation robot is composed of a gantry structure, a driving system, a weight support system, and a human-computer interaction system. Such a robot can assist the patients to stand and walk on the ground. Because of the weakness of the lower limbs on the affected side, stroke patients find it difficult to maintain their own body balance. The patients may fall due to a change in body posture caused by insufficient body function. Therefore, it is necessary to evaluate the stability of the rehabilitation robot after being impacted by the patient\u27s fall during use. This paper presents a method for the analysis of robot stability and develops an approximate mathematical model of the rehabilitation robot stability based on the response surface method. Optimal structural design parameters for the rehabilitation robot under impact are determined based on the response surface mathematical model. Finally, a stability experiment of the rehabilitation robot under the optimal structural parameters is performed. The experimental results demonstrate that the universal wheel maintains a close force contact with the ground, which proves the reliable stability of the robot

HRČAK - Portal of Croatian Scientific and Professional Journals

Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples

Author: Wei Shaokui
Wu Baoyuan
Zha Hongyuan
Zhang Mingda
Publication venue
Publication date: 19/07/2023
Field of study

Backdoor attacks are serious security threats to machine learning models where an adversary can inject poisoned samples into the training set, causing a backdoored model which predicts poisoned samples with particular triggers to particular target classes, while behaving normally on benign samples. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk, which mainly captures the risk on the shared adversarial examples (SAEs) between the backdoored model and the purified model. This upper bound further suggests a novel bi-level optimization problem for mitigating backdoor using adversarial training techniques. To solve it, we propose Shared Adversarial Unlearning (SAU). Specifically, SAU first generates SAEs, and then, unlearns the generated SAEs such that they are either correctly classified by the purified model and/or differently classified by the two models, such that the backdoor effect in the backdoored model will be mitigated in the purified model. Experiments on various benchmark datasets and network architectures show that our proposed method achieves state-of-the-art performance for backdoor defense

arXiv.org e-Print Archive

VDC: Versatile Data Cleanser for Detecting Dirty Samples via Visual-Linguistic Inconsistency

Author: Wei Shaokui
Wu Baoyuan
Wu Bingzhe
Zhang Mingda
Zhu Zihao
Publication venue
Publication date: 28/09/2023
Field of study

The role of data in building AI systems has recently been emphasized by the emerging concept of data-centric AI. Unfortunately, in the real-world, datasets may contain dirty samples, such as poisoned samples from backdoor attack, noisy labels in crowdsourcing, and even hybrids of them. The presence of such dirty samples makes the DNNs vunerable and unreliable.Hence, it is critical to detect dirty samples to improve the quality and realiability of dataset. Existing detectors only focus on detecting poisoned samples or noisy labels, that are often prone to weak generalization when dealing with dirty samples from other domains.In this paper, we find a commonality of various dirty samples is visual-linguistic inconsistency between images and associated labels. To capture the semantic inconsistency between modalities, we propose versatile data cleanser (VDC) leveraging the surpassing capabilities of multimodal large language models (MLLM) in cross-modal alignment and reasoning.It consists of three consecutive modules: the visual question generation module to generate insightful questions about the image; the visual question answering module to acquire the semantics of the visual content by answering the questions with MLLM; followed by the visual answer evaluation module to evaluate the inconsistency.Extensive experiments demonstrate its superior performance and generalization to various categories and types of dirty samples.Comment: 22 pages,5 figures,17 table

arXiv.org e-Print Archive