Vision-audio multimodal object recognition using hybrid and tensor fusion techniques

Ahmed, Md. Redwan; Haque, Rezaul; Rahman, S.M. Arafat; Reza, Ahmed Wasif; Siddique, Nazmul; Wang, Hui

research article

oai:pure.qub.ac.uk/portal:openaire/4a80b99c-aa8a-4ff2-8d24-2fef3082c880

Vision-audio multimodal object recognition using hybrid and tensor fusion techniques

Authors: Md. Redwan Ahmed
Rezaul Haque
S.M. Arafat Rahman
Ahmed Wasif Reza
Nazmul Siddique
Hui Wang
Publication date: 1 February 2026
Publisher
Doi

Abstract

Traditional fusion methods often encounter challenges related to temporal misalignment and signal variability, resulting in suboptimal performance. This study proposes a novel hybrid fusion model that integrates early and late fusion strategies to capture low-level feature interactions and high-level modality-specific abstractions. Advanced feature extraction techniques are employed to ensure robust multimodal representation: visual features are extracted using Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP), while audio features are processed using Spectral Centroid and Pitch-Synchronous Speech Features. Additionally, the Ridgelet Transform enhances spatial–temporal representation. A preprocessing pipeline further reduces data noise, applying Different Resolution Total Variation (DRTV) for visual noise suppression and Mel Frequency Cepstral Coefficients (MFCCs) for audio feature extraction. Furthermore, we incorporated an xLSTM-based hierarchical multi-scale temporal encoder in the audio branch and implemented an attention-based fusion stream with Feature-wise Linear Modulation (FiLM) for dynamic alignment based on different modalities. Class imbalance is addressed by applying SMOTE in the latent feature space and using class-weighted cross-entropy loss to improve model sensitivity to minority classes. Evaluated on a collected dataset of 22,133 audio-visual samples across 21 object categories, our proposed fusion model achieves an F1 score of 97.89% and a PR AUC of 98.02%. The attention-based fusion variant converged in 14 epochs but required more resources, totaling 19.1M parameters, 9.26G FLOPs, and 14.6 ms of inference latency. In contrast, hybrid fusion with LSTM provided a more efficient option with 12.0M parameters, 4.73G FLOPs, and 9.0 ms latency, making it ideal for low-resource edge applications. These results prove the proposed model’s flexibility in real-time multimodal applications such as autonomous systems, surveillance, and recycling automation

Similar works

Full text

Queen's University Belfast Research Portal

oai:pure.qub.ac.uk/portal:open...

Last time updated on 15/09/2025

This paper was published in Queen's University Belfast Research Portal.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: http://creativecommons.org/licenses/by/4.0/