47 research outputs found

    UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD

    Get PDF
    Performance of automatic speaker verification (ASV) systems is very sensitive to mismatch between training (source) and testing (target) domains. The best way to address domain mismatch is to perform matched condition training – gather sufficient labeled samples from the target domain and use them in training. However, in many cases this is too expensive or impractical. Usually, gaining access to unlabeled target domain data, e.g., from open source online media, and labeled data from other domains is more feasible. This work focuses on making ASV systems robust to uncontrolled (‘wild’) conditions, with the help of some unlabeled data acquired from such conditions. Given acoustic features from both domains, we propose learning a mapping function – a deep convolutional neural network (CNN) with an encoder-decoder architecture – between features of both the domains. We explore training the network in two different scenarios: training on paired speech samples from both domains and training on unpaired data. In the former case, where the paired data is usually obtained via simulation, the CNN is treated as a nonii ABSTRACT linear regression function and is trained to minimize L2 loss between original and predicted features from target domain. We provide empirical evidence that this approach introduces distortions that affect verification performance. To address this, we explore training the CNN using adversarial loss (along with L2), which makes the predicted features indistinguishable from the original ones, and thus, improve verification performance. The above framework using simulated paired data, though effective, cannot be used to train the network on unpaired data obtained by independently sampling speech from both domains. In this case, we first train a CNN using adversarial loss to map features from target to source. We, then, map the predicted features back to the target domain using an auxiliary network, and minimize a cycle-consistency loss between the original and reconstructed target features. Our unsupervised adaptation approach complements its supervised counterpart, where adaptation is done using labeled data from both domains. We focus on three domain mismatch scenarios: (1) sampling frequency mismatch between the domains, (2) channel mismatch, and (3) robustness to far-field and noisy speech acquired from wild conditions

    Artificial Bandwidth Extension of Speech Signals using Neural Networks

    Get PDF
    Although mobile wideband telephony has been standardized for over 15 years, many countries still do not have a nationwide network with good coverage. As a result, many cellphone calls are still downgraded to narrowband telephony. The resulting loss of quality can be reduced by artificial bandwidth extension. There has been great progress in bandwidth extension in recent years due to the use of neural networks. The topic of this thesis is the enhancement of artificial bandwidth extension using neural networks. A special focus is given to hands-free calls in a car, where the risk is high that the wideband connection is lost due to the fast movement. The bandwidth of narrowband transmission is not only reduced towards higher frequencies above 3.5 kHz but also towards lower frequencies below 300 Hz. There are already methods that estimate the low-frequency components quite well, which will therefore not be covered in this thesis. In most bandwidth extension algorithms, the narrowband signal is initially separated into a spectral envelope and an excitation signal. Both parts are then extended separately in order to finally combine both parts again. While the extension of the excitation can be implemented using simple methods without reducing the speech quality compared to wideband speech, the estimation of the spectral envelope for frequencies above 3.5 kHz is not yet solved satisfyingly. Current bandwidth extension algorithms are just able to reduce the quality loss due to narrowband transmission by a maximum of 50% in most evaluations. In this work, a modification for an existing method for excitation extension is proposed which achieves slight improvements while not generating additional computational complexity. In order to enhance the wideband envelope estimation with neural networks, two modifications of the training process are proposed. On the one hand, the loss function is extended with a discriminative part to address the different characteristics of phoneme classes. On the other hand, by using a GAN (generative adversarial network) for the training phase, a second network is added temporarily to evaluate the quality of the estimation. The neural networks that were trained are compared in subjective and objective evaluations. A final listening test addressed the scenario of a hands-free call in a car, which was simulated acoustically. The quality loss caused by the missing high frequency components could be reduced by 60% with the proposed approach.Obwohl die mobile Breitbandtelefonie bereits seit über 15 Jahren standardisiert ist, gibt es oftmals noch kein flächendeckendes Netz mit einer guten Abdeckung. Das führt dazu, dass weiterhin viele Mobilfunkgespräche auf Schmalbandtelefonie heruntergestuft werden. Der damit einhergehende Qualitätsverlust kann mit künstlicher Bandbreitenerweiterung reduziert werden. Das Thema dieser Arbeit sind Methoden zur weiteren Verbesserungen der Qualität des erweiterten Sprachsignals mithilfe neuronaler Netze. Ein besonderer Fokus liegt auf der Freisprech-Telefonie im Auto, da dabei das Risiko besonders hoch ist, dass durch die schnelle Fortbewegung die Breitbandverbindung verloren geht. Bei der Schmalbandübertragung fehlen neben den hochfrequenten Anteilen (etwa 3.5–7 kHz) auch tiefe Frequenzen unterhalb von etwa 300 Hz. Diese tieffrequenten Anteile können mit bereits vorhandenen Methoden gut geschätzt werden und sind somit nicht Teil dieser Arbeit. In vielen Algorithmen zur Bandbreitenerweiterung wird das Schmalbandsignal zu Beginn in eine spektrale Einhüllende und ein Anregungssignal aufgeteilt. Beide Anteile werden dann separat erweitert und schließlich wieder zusammengeführt. Während die Erweiterung der Anregung nahezu ohne Qualitätsverlust durch einfache Methoden umgesetzt werden kann ist die Schätzung der spektralen Einhüllenden für Frequenzen über 3.5 kHz noch nicht zufriedenstellend gelöst. Mit aktuellen Methoden können im besten Fall nur etwa 50% der durch Schmalbandübertragung reduzierten Qualität zurückgewonnen werden. Für die Anregungserweiterung wird in dieser Arbeit eine Variation vorgestellt, die leichte Verbesserungen erzielt ohne dabei einen Mehraufwand in der Berechnung zu erzeugen. Für die Schätzung der Einhüllenden des Breitbandsignals mithilfe neuronaler Netze werden zwei Änderungen am Trainingsprozess vorgeschlagen. Einerseits wird die Kostenfunktion um einen diskriminativen Anteil erweitert, der das Netz besser zwischen verschiedenen Phonemen unterscheiden lässt. Andererseits wird als Architektur ein GAN (Generative adversarial network) verwendet, wofür in der Trainingsphase ein zweites Netz verwendet wird, das die Qualität der Schätzung bewertet. Die trainierten neuronale Netze wurden in subjektiven und objektiven Tests verglichen. Ein abschließender Hörtest diente zur Evaluierung des Freisprechens im Auto, welches akustisch simuliert wurde. Der Qualitätsverlust durch Wegfallen der hohen Frequenzanteile konnte dabei mit dem vorgeschlagenen Ansatz um etwa 60% reduziert werden

    Robust Phase-based Speech Signal Processing From Source-Filter Separation to Model-Based Robust ASR

    Get PDF
    The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications. Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information. In this thesis a novel phase-domain source-filter model is proposed that allows for deconvolution of the speech vocal tract (filter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter. Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement. The robustness gain achieved through using statistical normalisation and generalised logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and finally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domains are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise
    corecore