3,621 research outputs found

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

    Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

    Full text link
    This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving

    Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

    Full text link
    Human state recognition is a critical topic with pervasive and important applications in human-machine systems.Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer.Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer

    Sign language recognition with transformer networks

    Get PDF
    Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation

    Guide Your Agent with Adaptive Multimodal Rewards

    Full text link
    Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: https://sites.google.com/view/2023arp.Comment: Project webpage: https://sites.google.com/view/2023ar

    A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System

    Full text link
    Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system that handles the task of understanding the user by capturing the necessary information in the form of slots and generating an appropriate response in accordance with the extracted information. Recently, dialogue systems integrated with complementary information such as images, audio, or video have gained immense popularity. In this work, we propose an end-to-end framework with the capability to extract necessary slot values from the utterance and generate a coherent response, thereby assisting the user to achieve their desired goals in a multimodal dialogue system having both textual and visual information. The task of extracting the necessary information is dependent not only on the text but also on the visual cues present in the dialogue. Similarly, for the generation, the previous dialog context comprising multimodal information is significant for providing coherent and informative responses. We employ a multimodal hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge base (Kb) to provide a stronger context for both the tasks. Finally, we design a slot attention mechanism to focus on the necessary information in a given utterance. Lastly, a decoder generates the corresponding response for the given dialogue context and the extracted slot values. Experimental results on the Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms the baselines approaches in both the tasks. The code is available at https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application

    μ½œλ“œ μŠ€νƒ€νŠΈ λΉ„λ””μ˜€ μΆ”μ²œμ‹œμŠ€ν…œμ„ μœ„ν•œ 컨텐츠 ν‘œν˜„ ν•™μŠ΅

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€λŒ€ν•™μ› λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€ν•™κ³Ό, 2023. 2. 이쀀석.Cold-start item recommendation is a long-standing challenge in recommendation systems. A common approach to tackle cold-start problem is using content-based approach, but in movie recommendations, rich information available in raw video contents or textual descriptions has not been fully utilized. In this paper, we propose a general cold-start recommendation framework that learns multimodal content representations from the rich information in raw videos and text, directly optimized over user-item interactions, instead of using embeddings pretrained on proxy pretext task. In addition, we further exploit multimodal alignment of the item contents in a self-supervised manner, revealing great potential in content representation learning. From extensive experiments on public benchmarks, we verify the effectiveness of our method, achieving state-of-the-art performance on cold-start movie recommendation.μ½œλ“œ μŠ€νƒ€νŠΈ μ•„μ΄ν…œ μΆ”μ²œμ€ μΆ”μ²œμ‹œμŠ€ν…œ μ—°κ΅¬μ—μ„œ 였래된 문제 쀑 ν•˜λ‚˜μ΄λ‹€. μ½œλ“œ μŠ€νƒ€νŠΈ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ ν”νžˆ μ‚¬μš©ν•΄μ˜¨ 방법은 컨텐츠 기반 μ ‘κ·Ό 방식을 μ‚¬μš©ν•˜λŠ” κ²ƒμ΄μ§€λ§Œ, μ˜ν™” μΆ”μ²œ μ‹œμŠ€ν…œ λΆ„μ•Όμ—μ„œλŠ” 원본 λΉ„λ””μ˜€ 및 원문 μ„€λͺ… 등에 λ‚΄μž¬λœ ν’λΆ€ν•œ 정보λ₯Ό μΆ©λΆ„νžˆ ν™œμš©ν•΄μ˜€μ§€ λͺ»ν–ˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•˜λŠ” μ½œλ“œ μŠ€νƒ€νŠΈ μΆ”μ²œ ν”„λ ˆμž„μ›Œν¬μ—μ„œλŠ” 원본 λΉ„λ””μ˜€μ™€ ν…μŠ€νŠΈμ˜ ν’λΆ€ν•œ 컨텐츠 정보λ₯Ό 기반으둜 λ©€ν‹°λͺ¨λ‹¬ 컨텐츠 ν‘œν˜„μ„ ν•™μŠ΅ν•˜λŠ” κ³Όμ •μ—μ„œ, λ‹€λ₯Έ νƒœμŠ€ν¬μ— 사전 ν•™μŠ΅λœ μž„λ² λ”©μ„ ν™œμš©ν•˜λŠ” λŒ€μ‹  μœ μ €-μ•„μ΄ν…œ μƒν˜Έμž‘μš© 정보λ₯Ό μ΄μš©ν•˜μ—¬ 직접 μž„λ² λ”©μ„ μ΅œμ ν™”ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. 더 λ‚˜μ•„κ°€, λ³Έ μ—°κ΅¬λŠ” 자기 지도 ν•™μŠ΅ 방법을 톡해 μ—¬λŸ¬ λͺ¨λ‹¬λ¦¬ν‹°λ‘œ ν‘œν˜„λ˜μ–΄ μžˆλŠ” μ•„μ΄ν…œ 컨텐츠λ₯Ό κ³ λ €ν•¨μœΌλ‘œμ¨ 컨텐츠 ν‘œν˜„ ν•™μŠ΅μ˜ λ°œμ „ κ°€λŠ₯성을 재쑰λͺ…ν•œλ‹€. μ΅œμ’…μ μœΌλ‘œ μ£Όμš” 벀치마크 데이터셋에 λŒ€ν•œ λ‹€μ–‘ν•œ μ‹€ν—˜μ„ 톡해 λ³Έ μ—°κ΅¬μ—μ„œ μ œμ•ˆν•˜λŠ” λ°©λ²•λ‘ μ˜ 효과λ₯Ό μž…μ¦ν•¨κ³Ό λ™μ‹œμ— μ½œλ“œ μŠ€νƒ€νŠΈ μ˜ν™” μΆ”μ²œ λΆ„μ•Όμ—μ„œ ν•΄λ‹Ή λΆ„μ•Ό 졜고 μ„±λŠ₯을 λ³΄μ΄λŠ” 사싀을 ν™•μΈν•˜μ˜€λ‹€.Chapter 1. Introduction 1 Chapter 2. Related Work 7 Chapter 3. Problem Formulation and Notations 10 Chapter 4. Preliminary 12 Chapter 5. The Proposed Method 16 Chapter 6. Experimental Settings 24 Chapter 7. Results and Discussion 28 Chapter 8. Summary and Future Work 36 Bibliography 37 Abstract in Korean 45석
    • …
    corecore