Multimodal sentiment analysis has a wide range of applications due to its information complementarityinmultimodal interactions. Previous works focus more oninvestigating efficient joint representations,but they rarelyconsidertheinsufficient unimodal features extractionanddata redundancy ofmultimodal fusion.In this paper, a Video-based Cross-modal Auxiliary Network (VCAN) is proposed, which is comprised of an audio features map module and a cross-modal selection module. The first moduleis designed to substantiallyincreasefeaturediversityin audio feature extraction, aiming to improve classification accuracy by providing more comprehensive acoustic representations.To empower the model to handle redundant visual features, thesecond moduleis addressedto efficiently filter the redundant visual framesduring integrating audiovisual data. Moreover, aclassifier group consisting of several image classification networks is introducedto predict sentiment polarities and emotion categories. Extensive experimental results on RAVDESS, CMU-MOSI, andCMU-MOSEIbenchmarks indicate that VCAN issignificantly superior to the state-of-the-artmethodsforimproving the classificationaccuracy of multimodal sentiment analysis.</p