Search CORE

3,621 research outputs found

Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

Author: Wang Qiangchang
Yin Yilong
Publication venue
Publication date: 02/06/2023
Field of study

Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

arXiv.org e-Print Archive

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

Author: Eising Ciarán
Hayes Martin
Rekanar Kaavya
Sistu Ganesh
Publication venue
Publication date: 18/07/2023
Field of study

This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving

arXiv.org e-Print Archive

Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

Author: Chen Guohua
Jo Wonse
Min Byung-Cheol
Wang Ruiqi
Wang Weizheng
Yang Baijian
Zhao Dezhong
Publication venue
Publication date: 29/09/2022
Field of study

Human state recognition is a critical topic with pervasive and important applications in human-machine systems.Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer.Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer

arXiv.org e-Print Archive

Sign language recognition with transformer networks

Author: Dambre Joni
De Coster Mathieu
Van Herreweghe Mieke
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation

Ghent University Academic Bibliography

Guide Your Agent with Adaptive Multimodal Rewards

Author: Kim Changyeon
Lee Honglak
Lee Kimin
Lee Lisa
Liu Hao
Seo Younggyo
Shin Jinwoo
Publication venue
Publication date: 19/09/2023
Field of study

Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: https://sites.google.com/view/2023arp.Comment: Project webpage: https://sites.google.com/view/2023ar

arXiv.org e-Print Archive

A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System

Author: Ekbal Asif
Firdaus Mauajama
Madasu Avinash
Publication venue
Publication date: 27/05/2023
Field of study

Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system that handles the task of understanding the user by capturing the necessary information in the form of slots and generating an appropriate response in accordance with the extracted information. Recently, dialogue systems integrated with complementary information such as images, audio, or video have gained immense popularity. In this work, we propose an end-to-end framework with the capability to extract necessary slot values from the utterance and generate a coherent response, thereby assisting the user to achieve their desired goals in a multimodal dialogue system having both textual and visual information. The task of extracting the necessary information is dependent not only on the text but also on the visual cues present in the dialogue. Similarly, for the generation, the previous dialog context comprising multimodal information is significant for providing coherent and informative responses. We employ a multimodal hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge base (Kb) to provide a stronger context for both the tasks. Finally, we design a slot attention mechanism to focus on the necessary information in a given utterance. Lastly, a decoder generates the corresponding response for the given dialogue context and the extracted slot values. Experimental results on the Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms the baselines approaches in both the tasks. The code is available at https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application

arXiv.org e-Print Archive

콜드 스타트 비디오 추천시스템을 위한 컨텐츠 표현 학습

Author: 김주은
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 데이터사이언스대학원 데이터사이언스학과, 2023. 2. 이준석.Cold-start item recommendation is a long-standing challenge in recommendation systems. A common approach to tackle cold-start problem is using content-based approach, but in movie recommendations, rich information available in raw video contents or textual descriptions has not been fully utilized. In this paper, we propose a general cold-start recommendation framework that learns multimodal content representations from the rich information in raw videos and text, directly optimized over user-item interactions, instead of using embeddings pretrained on proxy pretext task. In addition, we further exploit multimodal alignment of the item contents in a self-supervised manner, revealing great potential in content representation learning. From extensive experiments on public benchmarks, we verify the effectiveness of our method, achieving state-of-the-art performance on cold-start movie recommendation.콜드 스타트 아이템 추천은 추천시스템 연구에서 오래된 문제 중 하나이다. 콜드 스타트 문제를 해결하기 위해 흔히 사용해온 방법은 컨텐츠 기반 접근 방식을 사용하는 것이지만, 영화 추천 시스템 분야에서는 원본 비디오 및 원문 설명 등에 내재된 풍부한 정보를 충분히 활용해오지 못했다. 본 논문에서 제안하는 콜드 스타트 추천 프레임워크에서는 원본 비디오와 텍스트의 풍부한 컨텐츠 정보를 기반으로 멀티모달 컨텐츠 표현을 학습하는 과정에서, 다른 태스크에 사전 학습된 임베딩을 활용하는 대신 유저-아이템 상호작용 정보를 이용하여 직접 임베딩을 최적화하는 방법을 제안한다. 더 나아가, 본 연구는 자기 지도 학습 방법을 통해 여러 모달리티로 표현되어 있는 아이템 컨텐츠를 고려함으로써 컨텐츠 표현 학습의 발전 가능성을 재조명한다. 최종적으로 주요 벤치마크 데이터셋에 대한 다양한 실험을 통해 본 연구에서 제안하는 방법론의 효과를 입증함과 동시에 콜드 스타트 영화 추천 분야에서 해당 분야 최고 성능을 보이는 사실을 확인하였다.Chapter 1. Introduction 1 Chapter 2. Related Work 7 Chapter 3. Problem Formulation and Notations 10 Chapter 4. Preliminary 12 Chapter 5. The Proposed Method 16 Chapter 6. Experimental Settings 24 Chapter 7. Results and Discussion 28 Chapter 8. Summary and Future Work 36 Bibliography 37 Abstract in Korean 45석

SNU Open Repository and Archive