2 research outputs found

    Exploiting temporal context in CNN based multisource DOA estimation

    Get PDF
    Supervised learning methods are a powerful tool for direction of arrival (DOA) estimation because they can cope with adverse conditions where simplified models fail. In this work, we consider a previously proposed convolutional neural network (CNN) approach that estimates the DOAs for multiple sources from the phase spectra of the microphones. For speech, specifically, the approach was shown to work well even when trained entirely on synthetically generated data. However, as each frame is processed separately, temporal context cannot be taken into account. This prevents the exploitation of interframe signal correlations, and the fact that DOAs do not change arbitrarily over time. We therefore consider two different extensions of the CNN: the integration of a long short-term memory (LSTM) layer, or of a temporal convolutional network (TCN). In order to accommodate the incorporation of temporal context, the training data generation framework needs to be adjusted. To obtain an easily parameterizable model, we propose to employ Markov chains to realize a gradual evolution of the source activity at different times, frequencies, and directions, throughout a training sequence. A thorough evaluation demonstrates that the proposed configuration for generating training data is suitable for the tasks of single-, and multi-talker localization. In particular, we note that with temporal context, it is important to use speech, or realistic signals in general, for the sources. Experiments with recorded impulse responses and noise reveal that the CNN with the LSTM extension outperforms all other considered approaches, including the plain CNN, and the TCN extension

    Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

    Get PDF
    The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.acceptedVersionPeer reviewe
    corecore