Traditional fusion methods often encounter challenges related to temporal misalignment and signal variability, resulting in suboptimal performance. This study proposes a novel hybrid fusion model that integrates early and late fusion strategies to capture low-level feature interactions and high-level modality-specific abstractions. Advanced feature extraction techniques are employed to ensure robust multimodal representation: visual features are extracted using Scale-Invariant Feature Transform (SIFT) and Local Binary Patterns (LBP), while audio features are processed using Spectral Centroid and Pitch-Synchronous Speech Features. Additionally, the Ridgelet Transform enhances spatial–temporal representation. A preprocessing pipeline further reduces data noise, applying Different Resolution Total Variation (DRTV) for visual noise suppression and Mel Frequency Cepstral Coefficients (MFCCs) for audio feature extraction. Furthermore, we incorporated an xLSTM-based hierarchical multi-scale temporal encoder in the audio branch and implemented an attention-based fusion stream with Feature-wise Linear Modulation (FiLM) for dynamic alignment based on different modalities. Class imbalance is addressed by applying SMOTE in the latent feature space and using class-weighted cross-entropy loss to improve model sensitivity to minority classes. Evaluated on a collected dataset of 22,133 audio-visual samples across 21 object categories, our proposed fusion model achieves an F1 score of 97.89% and a PR AUC of 98.02%. The attention-based fusion variant converged in 14 epochs but required more resources, totaling 19.1M parameters, 9.26G FLOPs, and 14.6 ms of inference latency. In contrast, hybrid fusion with LSTM provided a more efficient option with 12.0M parameters, 4.73G FLOPs, and 9.0 ms latency, making it ideal for low-resource edge applications. These results prove the proposed model’s flexibility in real-time multimodal applications such as autonomous systems, surveillance, and recycling automation
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.