Search CORE

1 research outputs found

Machine learning and inferencing for the decomposition of speech mixtures

Author: Zhang Wenyang
Publication venue
Publication date: 29/01/2020
Field of study

In this dissertation, we present and evaluate a novel approach for incorporating machine learning and inferencing into the time-frequency decomposition of speech signals in the context of speaker-independent multi-speaker pitch tracking. The pitch tracking performance of the resulting algorithm is comparable to that of a state-of-the-art machine-learning algorithm for multi-pitch tracking while being significantly more computationally efficient and requiring much less training data. Multi-pitch tracking is a time-frequency signal processing problem in which mutual interferences of the harmonics from different speakers make it challenging to design an algorithm to reliably estimate the fundamental frequency trajectories of the individual speakers. The current state-of-the-art in speaker-independent multi-pitch tracking utilizes 1) a deep neural network for producing spectrograms of individual speakers and 2) another deep neural network that acts upon the individual spectrograms and the original audio’s spectrogram to produce estimates of the pitch tracks of the individual speakers. However, the implementation of this Multi-Spectrogram Machine- Learning (MS-ML) algorithm could be computationally intensive and make it impractical for hardware platforms such as embedded devices where the computational power is limited. Instead of utilizing deep neural networks to estimate the pitch values directly, we have derived and evaluated a fault recognition and diagnosis (FRD) framework that utilizes machine learning and inferencing techniques to recognize potential faults in the pitch tracks produced by a traditional multi-pitch tracking algorithm. The result of this fault-recognition phase is then used to trigger a fault-diagnosis phase aimed at resolving the recognized fault(s) through adaptive adjustment of the time-frequency analysis of the input signal. The pitch estimates produced by the resulting FRD-ML algorithm are found to be comparable in accuracy to those produced via the MS-ML algorithm. However, our evaluation of the FRD-ML algorithm shows it to have significant advantages over the MS-ML algorithm. Specifically, the number of multiplications per second in FRD-ML is found to be two orders of magnitude less while the number of additions per second is about the same as in the MS-ML algorithm. Furthermore, the required amount of training data to achieve optimal performance is found to be two orders of magnitude less for the FRD-ML algorithm in comparison to the MS-ML algorithm. The reduction in the number of multiplications per second means it is more feasible to implement the MPT solution on hardware platforms with limited computational power such as embedded devices rather than relying on Graphics Processing Units (GPUs) or cloud computing. The reduction in training data size makes the algorithm more flexible in terms of configuring for different application scenarios such as training for different languages where there may not be a large amount of training data

Boston University Institutional Repository (OpenBU)