3,621 research outputs found
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving
Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition
Human state recognition is a critical topic with pervasive and important
applications in human-machine systems.Multi-modal fusion, the combination of
metrics from multiple data sources, has been shown as a sound method for
improving the recognition performance. However, while promising results have
been reported by recent multi-modal-based models, they generally fail to
leverage the sophisticated fusion strategies that would model sufficient
cross-modal interactions when producing the fusion representation; instead,
current methods rely on lengthy and inconsistent data preprocessing and feature
crafting. To address this limitation, we propose an end-to-end multi-modal
transformer framework for multi-modal human state recognition called
Husformer.Specifically, we propose to use cross-modal transformers, which
inspire one modality to reinforce itself through directly attending to latent
relevance revealed in other modalities, to fuse different modalities while
ensuring sufficient awareness of the cross-modal interactions introduced.
Subsequently, we utilize a self-attention transformer to further prioritize
contextual information in the fusion representation. Using two such attention
mechanisms enables effective and adaptive adjustments to noise and
interruptions in multi-modal signals during the fusion process and in relation
to high-level features. Extensive experiments on two human emotion corpora
(DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad)
demonstrate that in the recognition of human state, our Husformer outperforms
both state-of-the-art multi-modal baselines and the use of a single modality by
a large margin, especially when dealing with raw multi-modal signals. We also
conducted an ablation study to show the benefits of each component in
Husformer
Sign language recognition with transformer networks
Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid research into sign languages and sign language recognition. Previous research has approached sign language recognition in various ways, using feature extraction techniques or end-to-end deep learning. In this work, we apply a combination of feature extraction using OpenPose for human keypoint estimation and end-to-end feature learning with Convolutional Neural Networks. The proven multi-head attention mechanism used in transformers is applied to recognize isolated signs in the Flemish Sign Language corpus. Our proposed method significantly outperforms the previous state of the art of sign language recognition on the Flemish Sign Language corpus: we obtain an accuracy of 74.7% on a vocabulary of 100 classes. Our results will be implemented as a suggestion system for sign language corpus annotation
Guide Your Agent with Adaptive Multimodal Rewards
Developing an agent capable of adapting to unseen environments remains a
difficult challenge in imitation learning. In this work, we present Adaptive
Return-conditioned Policy (ARP), an efficient framework designed to enhance the
agent's generalization ability using natural language task descriptions and
pre-trained multimodal encoders. Our key idea is to calculate a similarity
between visual observations and natural language instructions in the
pre-trained multimodal embedding space (such as CLIP) and use it as a reward
signal. We then train a return-conditioned policy using expert demonstrations
labeled with multimodal rewards. Because the multimodal rewards provide
adaptive signals at each timestep, our ARP effectively mitigates the goal
misgeneralization. This results in superior generalization performances even
when faced with unseen text instructions, compared to existing text-conditioned
policies. To improve the quality of rewards, we also introduce a fine-tuning
method for pre-trained multimodal encoders, further enhancing the performance.
Video demonstrations and source code are available on the project website:
https://sites.google.com/view/2023arp.Comment: Project webpage: https://sites.google.com/view/2023ar
A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System
Natural Language Understanding (NLU) and Natural Language Generation (NLG)
are the two critical components of every conversational system that handles the
task of understanding the user by capturing the necessary information in the
form of slots and generating an appropriate response in accordance with the
extracted information. Recently, dialogue systems integrated with complementary
information such as images, audio, or video have gained immense popularity. In
this work, we propose an end-to-end framework with the capability to extract
necessary slot values from the utterance and generate a coherent response,
thereby assisting the user to achieve their desired goals in a multimodal
dialogue system having both textual and visual information. The task of
extracting the necessary information is dependent not only on the text but also
on the visual cues present in the dialogue. Similarly, for the generation, the
previous dialog context comprising multimodal information is significant for
providing coherent and informative responses. We employ a multimodal
hierarchical encoder using pre-trained DialoGPT and also exploit the knowledge
base (Kb) to provide a stronger context for both the tasks. Finally, we design
a slot attention mechanism to focus on the necessary information in a given
utterance. Lastly, a decoder generates the corresponding response for the given
dialogue context and the extracted slot values. Experimental results on the
Multimodal Dialogue Dataset (MMD) show that the proposed framework outperforms
the baselines approaches in both the tasks. The code is available at
https://github.com/avinashsai/slot-gpt.Comment: Published in the journal Multimedia Tools and Application
μ½λ μ€ννΈ λΉλμ€ μΆμ²μμ€ν μ μν 컨ν μΈ νν νμ΅
νμλ
Όλ¬Έ(μμ¬) -- μμΈλνκ΅λνμ : λ°μ΄ν°μ¬μ΄μΈμ€λνμ λ°μ΄ν°μ¬μ΄μΈμ€νκ³Ό, 2023. 2. μ΄μ€μ.Cold-start item recommendation is a long-standing challenge in recommendation systems. A common approach to tackle cold-start problem is using content-based approach, but in movie recommendations, rich information available in raw video contents or textual descriptions has not been fully utilized. In this paper, we propose a general cold-start recommendation framework that learns multimodal content representations from the rich information in raw videos and text, directly optimized over user-item interactions, instead of using embeddings pretrained on proxy pretext task. In addition, we further exploit multimodal alignment of the item contents in a self-supervised manner, revealing great potential in content representation learning. From extensive experiments on public benchmarks, we verify the effectiveness of our method, achieving state-of-the-art performance on cold-start movie recommendation.μ½λ μ€ννΈ μμ΄ν
μΆμ²μ μΆμ²μμ€ν
μ°κ΅¬μμ μ€λλ λ¬Έμ μ€ νλμ΄λ€. μ½λ μ€ννΈ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ νν μ¬μ©ν΄μ¨ λ°©λ²μ 컨ν
μΈ κΈ°λ° μ κ·Ό λ°©μμ μ¬μ©νλ κ²μ΄μ§λ§, μν μΆμ² μμ€ν
λΆμΌμμλ μλ³Έ λΉλμ€ λ° μλ¬Έ μ€λͺ
λ±μ λ΄μ¬λ νλΆν μ 보λ₯Ό μΆ©λΆν νμ©ν΄μ€μ§ λͺ»νλ€. λ³Έ λ
Όλ¬Έμμ μ μνλ μ½λ μ€ννΈ μΆμ² νλ μμν¬μμλ μλ³Έ λΉλμ€μ ν
μ€νΈμ νλΆν 컨ν
μΈ μ 보λ₯Ό κΈ°λ°μΌλ‘ λ©ν°λͺ¨λ¬ 컨ν
μΈ ννμ νμ΅νλ κ³Όμ μμ, λ€λ₯Έ νμ€ν¬μ μ¬μ νμ΅λ μλ² λ©μ νμ©νλ λμ μ μ -μμ΄ν
μνΈμμ© μ 보λ₯Ό μ΄μ©νμ¬ μ§μ μλ² λ©μ μ΅μ ννλ λ°©λ²μ μ μνλ€. λ λμκ°, λ³Έ μ°κ΅¬λ μκΈ° μ§λ νμ΅ λ°©λ²μ ν΅ν΄ μ¬λ¬ λͺ¨λ¬λ¦¬ν°λ‘ ννλμ΄ μλ μμ΄ν
컨ν
μΈ λ₯Ό κ³ λ €ν¨μΌλ‘μ¨ μ»¨ν
μΈ νν νμ΅μ λ°μ κ°λ₯μ±μ μ¬μ‘°λͺ
νλ€. μ΅μ’
μ μΌλ‘ μ£Όμ λ²€μΉλ§ν¬ λ°μ΄ν°μ
μ λν λ€μν μ€νμ ν΅ν΄ λ³Έ μ°κ΅¬μμ μ μνλ λ°©λ²λ‘ μ ν¨κ³Όλ₯Ό μ
μ¦ν¨κ³Ό λμμ μ½λ μ€ννΈ μν μΆμ² λΆμΌμμ ν΄λΉ λΆμΌ μ΅κ³ μ±λ₯μ 보μ΄λ μ¬μ€μ νμΈνμλ€.Chapter 1. Introduction 1
Chapter 2. Related Work 7
Chapter 3. Problem Formulation and Notations 10
Chapter 4. Preliminary 12
Chapter 5. The Proposed Method 16
Chapter 6. Experimental Settings 24
Chapter 7. Results and Discussion 28
Chapter 8. Summary and Future Work 36
Bibliography 37
Abstract in Korean 45μ
- β¦