In many signal processing applications, metadata may be advantageously used
in conjunction with a high dimensional signal to produce a desired output. In
the case of classical Sound Source Localization (SSL) algorithms, information
from a high dimensional, multichannel audio signals received by many
distributed microphones is combined with information describing acoustic
properties of the scene, such as the microphones' coordinates in space, to
estimate the position of a sound source. We introduce Dual Input Neural
Networks (DI-NNs) as a simple and effective way to model these two data types
in a neural network. We train and evaluate our proposed DI-NN on scenarios of
varying difficulty and realism and compare it against an alternative
architecture, a classical Least-Squares (LS) method as well as a classical
Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN
significantly outperforms the baselines, achieving a five times lower
localization error than the LS method and two times lower than the CRNN in a
test dataset of real recordings