Towards Multi-Modal Data Classification

Abstract

A feature fusion multi-modal neural network (MMN) is a network that combines different modalities at the feature level to perform a specific task. In this paper, we study the problem of training the fusion procedure for MMN. A recent study has found that training a multi-modal network that incorporates late fusion produces a network that has not learned the proper parameters for feature extraction. These late fusion models perform very well during training but fall short to its single modality counterpart when testing. We hypothesize that jointly trained MMN have weight space that is too large for effective training. To remedy this problem, we design a set of procedures that systematically narrow the search space so that the optimizer would only consider weights that are known to generalize well. As part of our systematic narrowing procedure, we enforce a weight constraint on the weights between the pre-fusion and fusion layers. Due to our given constraints on the network, modern methods cannot optimize our network without breaking our conditions. To remedy the problem, we create a simplex projection module that will be used after applying modern training frameworks. Our module will re-optimize our network such that the weight constraints are enforced. This new framework, which we call Projection Feature Mixture Model outperforms its single modality model as well as standard jointly trained MMN. In this paper, we provide a theoretical analysis to show advantages of utilizing MMN

    Similar works