Video based Cross-modal Auxiliary Network for Multimodal Sentiment Analysis

Abstract

Multimodal  sentiment  analysis  has  a  wide  range  of applications    due    to    its information    complementarityinmultimodal    interactions. Previous    works    focus more oninvestigating efficient   joint representations,but they rarelyconsidertheinsufficient  unimodal  features  extractionanddata redundancy ofmultimodal  fusion.In  this  paper,  a  Video-based Cross-modal  Auxiliary  Network  (VCAN)  is  proposed,  which  is comprised  of an  audio  features  map  module  and a  cross-modal selection  module.  The first  moduleis  designed  to substantiallyincreasefeaturediversityin  audio  feature  extraction, aiming  to improve classification accuracy by providing more comprehensive acoustic representations.To   empower   the   model   to   handle redundant  visual features,  thesecond  moduleis addressedto efficiently  filter  the  redundant  visual framesduring  integrating audiovisual data. Moreover, aclassifier group consisting of several image  classification  networks  is  introducedto  predict  sentiment polarities and emotion categories. Extensive experimental results on   RAVDESS, CMU-MOSI, andCMU-MOSEIbenchmarks indicate that VCAN issignificantly superior to the state-of-the-artmethodsforimproving the classificationaccuracy of multimodal sentiment analysis.</p

    Similar works

    Full text

    thumbnail-image

    Available Versions