Search CORE

73 research outputs found

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Author: Chang Kai-Wei
Chang Shih-Fu
He Yicheng
Li Wenhao
Wang Zhecan
You Haoxuan
Publication venue
Publication date: 23/10/2023
Field of study

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.Comment: Accepted to EMNLP 2022 Long Pape

arXiv.org e-Print Archive

MMBench: Is Your Multi-modal Model an All-around Player?

Author: Chen Kai
Duan Haodong
He Conghui
Li Bo
Lin Dahua
Liu Yuan
Liu Ziwei
Wang Jiaqi
Yuan Yike
Zhang Songyang
Zhang Yuanhan
Zhao Wangbo
Publication venue
Publication date: 13/08/2023
Field of study

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench

arXiv.org e-Print Archive

Emotion-aware cross-modal domain adaptation in video sequences

Author: Athanasiadis Christos
Publication venue: 'University of Maastricht'
Publication date: 01/01/2022
Field of study

Maastricht University Research Portal

Representation learning on heterogeneous spatiotemporal networks

Author: Chandra Dakshak Keerthi
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2022
Field of study

“The problem of learning latent representations of heterogeneous networks with spatial and temporal attributes has been gaining traction in recent years, given its myriad of real-world applications. Most systems with applications in the field of transportation, urban economics, medical information, online e-commerce, etc., handle big data that can be structured into Spatiotemporal Heterogeneous Networks (SHNs), thereby making efficient analysis of these networks extremely vital. In recent years, representation learning models have proven to be quite efficient in capturing effective lower-dimensional representations of data. But, capturing efficient representations of SHNs continues to pose a challenge for the following reasons: (i) Spatiotemporal data that is structured as SHN encapsulate complex spatial and temporal relationships that exist among real-world objects, rendering traditional feature engineering approaches inefficient and compute-intensive; (ii) Due to the unique nature of the SHNs, existing representation learning techniques cannot be directly adopted to capture their representations. To address the problem of learning representations of SHNs, four novel frameworks that focus on their unique spatial and temporal characteristics are introduced: (i) collective representation learning, which focuses on quantifying the importance of each latent feature using Laplacian scores; (ii) modality aware representation learning, which learns from the complex user mobility pattern; (iii) distributed representation learning, which focuses on learning human mobility patterns by leveraging Natural Language Processing algorithms; and (iv) representation learning with node sense disambiguation, which learns contrastive senses of nodes in SHNs. The developed frameworks can help us capture higher-order spatial and temporal interactions of real-world SHNs. Through data-driven simulations, machine learning and deep learning models trained on the representations learned from the developed frameworks are proven to be much more efficient and effective”--Abstract, page iii

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Semantic discovery and reuse of business process patterns

Author: Aldin L
de Cesare S
Lycett M
Publication venue: Athens University of Economics and Business
Publication date: 01/01/2009
Field of study

Patterns currently play an important role in modern information systems (IS) development and their use has mainly been restricted to the design and implementation phases of the development lifecycle. Given the increasing significance of business modelling in IS development, patterns have the potential of providing a viable solution for promoting reusability of recurrent generalized models in the very early stages of development. As a statement of research-in-progress this paper focuses on business process patterns and proposes an initial methodological framework for the discovery and reuse of business process patterns within the IS development lifecycle. The framework borrows ideas from the domain engineering literature and proposes the use of semantics to drive both the discovery of patterns as well as their reuse

OpenGrey Repository

Brunel University Research Archive

AIS Electronic Library (AISeL)

WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM

Author: Elkelany Amany
McKeever Susan
Ross Robert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/02/2023
Field of study

Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments

Arrow@TUDublin

Detection of Microcosms on Twitter

Author: INUWA-DUTSE ISA
Publication venue
Publication date: 07/01/2020
Field of study

Edge Hill University Research Information Repository

Towards Generalizable Deep Image Matting: Decomposition, Interaction, and Merging

Author: Li Jizhizi
Publication venue: Faculty of Engineering, School of Computer Science
Publication date: 01/01/2023
Field of study

Image matting refers to extracting the precise alpha mattes from images, playing a critical role in many downstream applications. Despite extensive attention, key challenges persist and motivate the research presented in this thesis. One major challenge is the reliance of auxiliary inputs in previous methods, hindering real-time practicality. To address this, we introduce fully automatic image matting by decomposing the task into high-level semantic segmentation and low-level details matting. We then incorporate plug-in modules to enhance the interaction between the sub-tasks through feature integration. Furthermore, we propose an attention-based mechanism to guide the matting process through collaboration merging. Another challenge lies in limited matting datasets, resulting in reliance on composite images and inferior performance on images in the wild. In response, our research proposes a composition route to mitigate the discrepancies and result in remarkable generalization ability. Additionally, we construct numerous large datasets of high-quality real-world images with manually labeled alpha mattes, providing a solid foundation for training and evaluation. Moreover, our research uncovers new observations that warrant further investigation. Firstly, we systematically analyze and address privacy issues that have been neglected in previous portrait matting research. Secondly, we explore the adaptation of automatic matting methods to non-salient or transparent categories beyond salient ones. Furthermore, we collaborate with language modality to achieve a more controllable matting process, enabling specific target selection at a low cost. To validate our studies, we conduct extensive experiments and provide all codes and datasets through the link (https://github.com/JizhiziLi/). We believe that the analyses, methods, and datasets presented in this thesis will offer valuable insights for future research endeavors in the field of image matting

Sydney eScholarship

Learning by correlation for computer vision applications: from Kernel methods to deep learning

Author: Cavazza Jacopo
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 28/03/2018
Field of study

Learning to spot analogies and differences within/across visual categories is an arguably powerful approach in machine learning and pattern recognition which is directly inspired by human cognition. In this thesis, we investigate a variety of approaches which are primarily driven by correlation and tackle several computer vision applications

Archivio istituzionale della ricerca - Università di Genova

STAIRS 2014:proceedings of the 7th European Starting AI Researcher Symposium

Author
Publication venue: 'IOS Press'
Publication date: 01/01/2014
Field of study

International Migration, Integration and Social Cohesion online publications