489 research outputs found

    Artificial Bandwidth Extension of Speech Signals using Neural Networks

    Get PDF
    Although mobile wideband telephony has been standardized for over 15 years, many countries still do not have a nationwide network with good coverage. As a result, many cellphone calls are still downgraded to narrowband telephony. The resulting loss of quality can be reduced by artificial bandwidth extension. There has been great progress in bandwidth extension in recent years due to the use of neural networks. The topic of this thesis is the enhancement of artificial bandwidth extension using neural networks. A special focus is given to hands-free calls in a car, where the risk is high that the wideband connection is lost due to the fast movement. The bandwidth of narrowband transmission is not only reduced towards higher frequencies above 3.5 kHz but also towards lower frequencies below 300 Hz. There are already methods that estimate the low-frequency components quite well, which will therefore not be covered in this thesis. In most bandwidth extension algorithms, the narrowband signal is initially separated into a spectral envelope and an excitation signal. Both parts are then extended separately in order to finally combine both parts again. While the extension of the excitation can be implemented using simple methods without reducing the speech quality compared to wideband speech, the estimation of the spectral envelope for frequencies above 3.5 kHz is not yet solved satisfyingly. Current bandwidth extension algorithms are just able to reduce the quality loss due to narrowband transmission by a maximum of 50% in most evaluations. In this work, a modification for an existing method for excitation extension is proposed which achieves slight improvements while not generating additional computational complexity. In order to enhance the wideband envelope estimation with neural networks, two modifications of the training process are proposed. On the one hand, the loss function is extended with a discriminative part to address the different characteristics of phoneme classes. On the other hand, by using a GAN (generative adversarial network) for the training phase, a second network is added temporarily to evaluate the quality of the estimation. The neural networks that were trained are compared in subjective and objective evaluations. A final listening test addressed the scenario of a hands-free call in a car, which was simulated acoustically. The quality loss caused by the missing high frequency components could be reduced by 60% with the proposed approach.Obwohl die mobile Breitbandtelefonie bereits seit ĂŒber 15 Jahren standardisiert ist, gibt es oftmals noch kein flĂ€chendeckendes Netz mit einer guten Abdeckung. Das fĂŒhrt dazu, dass weiterhin viele MobilfunkgesprĂ€che auf Schmalbandtelefonie heruntergestuft werden. Der damit einhergehende QualitĂ€tsverlust kann mit kĂŒnstlicher Bandbreitenerweiterung reduziert werden. Das Thema dieser Arbeit sind Methoden zur weiteren Verbesserungen der QualitĂ€t des erweiterten Sprachsignals mithilfe neuronaler Netze. Ein besonderer Fokus liegt auf der Freisprech-Telefonie im Auto, da dabei das Risiko besonders hoch ist, dass durch die schnelle Fortbewegung die Breitbandverbindung verloren geht. Bei der SchmalbandĂŒbertragung fehlen neben den hochfrequenten Anteilen (etwa 3.5–7 kHz) auch tiefe Frequenzen unterhalb von etwa 300 Hz. Diese tieffrequenten Anteile können mit bereits vorhandenen Methoden gut geschĂ€tzt werden und sind somit nicht Teil dieser Arbeit. In vielen Algorithmen zur Bandbreitenerweiterung wird das Schmalbandsignal zu Beginn in eine spektrale EinhĂŒllende und ein Anregungssignal aufgeteilt. Beide Anteile werden dann separat erweitert und schließlich wieder zusammengefĂŒhrt. WĂ€hrend die Erweiterung der Anregung nahezu ohne QualitĂ€tsverlust durch einfache Methoden umgesetzt werden kann ist die SchĂ€tzung der spektralen EinhĂŒllenden fĂŒr Frequenzen ĂŒber 3.5 kHz noch nicht zufriedenstellend gelöst. Mit aktuellen Methoden können im besten Fall nur etwa 50% der durch SchmalbandĂŒbertragung reduzierten QualitĂ€t zurĂŒckgewonnen werden. FĂŒr die Anregungserweiterung wird in dieser Arbeit eine Variation vorgestellt, die leichte Verbesserungen erzielt ohne dabei einen Mehraufwand in der Berechnung zu erzeugen. FĂŒr die SchĂ€tzung der EinhĂŒllenden des Breitbandsignals mithilfe neuronaler Netze werden zwei Änderungen am Trainingsprozess vorgeschlagen. Einerseits wird die Kostenfunktion um einen diskriminativen Anteil erweitert, der das Netz besser zwischen verschiedenen Phonemen unterscheiden lĂ€sst. Andererseits wird als Architektur ein GAN (Generative adversarial network) verwendet, wofĂŒr in der Trainingsphase ein zweites Netz verwendet wird, das die QualitĂ€t der SchĂ€tzung bewertet. Die trainierten neuronale Netze wurden in subjektiven und objektiven Tests verglichen. Ein abschließender Hörtest diente zur Evaluierung des Freisprechens im Auto, welches akustisch simuliert wurde. Der QualitĂ€tsverlust durch Wegfallen der hohen Frequenzanteile konnte dabei mit dem vorgeschlagenen Ansatz um etwa 60% reduziert werden

    Speech Enhancement Exploiting the Source-Filter Model

    Get PDF
    Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user’s speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.Mobiltelefonie ist aus dem heutigen Leben nicht mehr wegzudenken. Telefonate werden in beliebigen Situationen an beliebigen Orten gefĂŒhrt und dabei nimmt das Mikrofon nicht nur die Sprache des Nutzers auf, sondern auch die UmgebungsgerĂ€usche, welche das VerstĂ€ndnis des GesprĂ€chspartners stark beeinflussen können. Moderne Systeme können durch Sprachverbesserungsalgorithmen solchen Effekten entgegenwirken, dabei ist vielen Nutzern nicht einmal bewusst, dass diese Algorithmen existieren. In dieser Arbeit wird die Entwicklung eines einkanaligen Sprachverbesserungssystems vorgestellt. Der Ansatz setzt auf das Teile-und-herrsche-Verfahren, um störende UmgebungsgerĂ€usche aus Mikrofonsignalen herauszufiltern. Dieses Verfahren kann fĂŒr sĂ€mtliche FĂ€lle angewendet werden, in denen Sprache aus verrauschten Signalen extrahiert werden soll. Der Ansatz nutzt das Quelle-Filter-Modell, um das ursprĂŒngliche Problem in zwei Unterprobleme aufzuteilen, die anschließend gelöst werden, indem die Quelle (das Anregungssignal) und das Filter (die spektrale EinhĂŒllende) separat verbessert werden. Die verbesserten Signale werden gemeinsam genutzt, um das gestörte Mikrofonsignal zu entrauschen. Die SchĂ€tzung von spektralen EinhĂŒllenden wurde bereits in der Vergangenheit erforscht und zum Teil auch fĂŒr die Sprachverbesserung angewandt. Typischerweise wird dabei jedoch das Anregungssignal vernachlĂ€ssigt, so dass die spektrale Feinstruktur des Mikrofonsignals nicht verbessert werden kann. Beide AnsĂ€tze nutzen jeweils die Eigenschaften der cepstralen DomĂ€ne, die unter anderem vorteilhafte mathematische Eigenschaften mit sich bringen, sowie die Möglichkeit, Prototypen eines Anregungssignals zu erzeugen. Wir untersuchen modellbasierte AnsĂ€tze, wie z.B. Gaußsche Mischmodelle, klassische signalverarbeitungsbasierte Lösungen und auch moderne tiefe neuronale Netzwerke in dieser Arbeit. Die so verbesserten Signale werden nicht direkt zur Sprachsignalverbesserung genutzt (z.B. Sprachsynthese), sondern als sogenannter A-priori-Signal-zu-Rauschleistungs-SchĂ€tzwert in einem traditionellen statistischen Sprachverbesserungssystem. Dieses besteht aus einem Störleistungs-SchĂ€tzer, einem A-priori-Signal-zu-Rauschleistungs-SchĂ€tzer und einer spektralen Gewichtungsregel, die ĂŒblicherweise mit Hilfe der Ergebnisse der beiden SchĂ€tzer berechnet wird. Schließlich wird eine SchĂ€tzung des sauberen Sprachsignals aus der Mikrofonaufnahme gewonnen. Der neue Ansatz bietet eine signifikant höhere DĂ€mpfung des StörgerĂ€uschs als der bisherige Stand der Technik. Dabei wird eine vergleichbare QualitĂ€t der Sprachkomponente und der SprachverstĂ€ndlichkeit gewĂ€hrleistet. Somit konnte die GesamtqualitĂ€t des verbesserten Sprachsignals gegenĂŒber dem Stand der Technik erhöht werden

    Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

    Full text link
    Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language Processin

    Generating intelligible audio speech from visual speech

    Get PDF
    This work is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model (AAM) visual features. Two further methods are then developed to incorporate temporal information into the prediction - a feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimised through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favourably with a previous regression-based system that serves as a baseline which achieved a word accuracy of 33%

    UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD

    Get PDF
    Performance of automatic speaker verification (ASV) systems is very sensitive to mismatch between training (source) and testing (target) domains. The best way to address domain mismatch is to perform matched condition training – gather sufficient labeled samples from the target domain and use them in training. However, in many cases this is too expensive or impractical. Usually, gaining access to unlabeled target domain data, e.g., from open source online media, and labeled data from other domains is more feasible. This work focuses on making ASV systems robust to uncontrolled (‘wild’) conditions, with the help of some unlabeled data acquired from such conditions. Given acoustic features from both domains, we propose learning a mapping function – a deep convolutional neural network (CNN) with an encoder-decoder architecture – between features of both the domains. We explore training the network in two different scenarios: training on paired speech samples from both domains and training on unpaired data. In the former case, where the paired data is usually obtained via simulation, the CNN is treated as a nonii ABSTRACT linear regression function and is trained to minimize L2 loss between original and predicted features from target domain. We provide empirical evidence that this approach introduces distortions that affect verification performance. To address this, we explore training the CNN using adversarial loss (along with L2), which makes the predicted features indistinguishable from the original ones, and thus, improve verification performance. The above framework using simulated paired data, though effective, cannot be used to train the network on unpaired data obtained by independently sampling speech from both domains. In this case, we first train a CNN using adversarial loss to map features from target to source. We, then, map the predicted features back to the target domain using an auxiliary network, and minimize a cycle-consistency loss between the original and reconstructed target features. Our unsupervised adaptation approach complements its supervised counterpart, where adaptation is done using labeled data from both domains. We focus on three domain mismatch scenarios: (1) sampling frequency mismatch between the domains, (2) channel mismatch, and (3) robustness to far-field and noisy speech acquired from wild conditions

    Single-Microphone Speech Enhancement and Separation Using Deep Learning

    Get PDF

    Neural networks for optical channel equalization in high speed communication systems

    Get PDF
    La demande future de bande passante pour les donnĂ©es dĂ©passera les capacitĂ©s des systĂšmes de communication optique actuels, qui approchent de leurs limites en raison des limitations de la bande passante Ă©lectrique des composants de l’émetteur. L’interfĂ©rence intersymbole (ISI) due Ă  cette limitation de bande est le principal facteur de dĂ©gradation pour atteindre des dĂ©bits de donnĂ©es Ă©levĂ©s. Dans ce mĂ©moire, nous Ă©tudions plusieurs techniques de rĂ©seaux neuronaux (NN) pour combattre les limites physiques des composants de l’émetteur pilotĂ©s Ă  des dĂ©bits de donnĂ©es Ă©levĂ©s et exploitant les formats de modulation avancĂ©s avec une dĂ©tection cohĂ©rente. Notre objectif principal avec les NN comme Ă©galiseurs de canaux ISI est de surmonter les limites des rĂ©cepteurs optimaux conventionnels, en fournissant une complexitĂ© Ă©volutive moindre et une solution quasi optimale. Nous proposons une nouvelle architecture bidirectionnelle profonde de mĂ©moire Ă  long terme (BiLSTM), qui est efficace pour attĂ©nuer les graves problĂšmes d’ISI causĂ©s par les composants Ă  bande limitĂ©e. Pour la premiĂšre fois, nous dĂ©montrons par simulation que notre BiLSTM profonde proposĂ©e atteint le mĂȘme taux d’erreur sur les bits(TEB) qu’un estimateur de sĂ©quence Ă  maximum de vraisemblance (MLSE) optimal pour la modulation MDPQ. Les NN Ă©tant des modĂšles pilotĂ©s par les donnĂ©es, leurs performances dĂ©pendent fortement de la qualitĂ© des donnĂ©es d’entrĂ©e. Nous dĂ©montrons comment les performances du BiLSTM profond rĂ©alisable se dĂ©gradent avec l’augmentation de l’ordre de modulation. Nous examinons Ă©galement l’impact de la sĂ©vĂ©ritĂ© de l’ISI et de la longueur de la mĂ©moire du canal sur les performances de la BiLSTM profonde. Nous Ă©tudions les performances de divers canaux synthĂ©tiques Ă  bande limitĂ©e ainsi qu’un canal optique mesurĂ© Ă  100 Gbaud en utilisant un modulateur photonique au silicium (SiP) de 35 GHz. La gravitĂ© ISI de ces canaux est quantifiĂ©e grĂące Ă  une nouvelle vue graphique des performances basĂ©e sur les Ă©carts de performance de base entre les solutions optimales linĂ©aires et non linĂ©aires classiques. Aux ordres QAM supĂ©rieurs Ă  la QPSK, nous quantifions l’écart de performance BiLSTM profond par rapport Ă  la MLSE optimale Ă  mesure que la sĂ©vĂ©ritĂ© ISI augmente. Alors qu’elle s’approche des performances optimales de la MLSE Ă  8QAM et 16QAM avec une pĂ©nalitĂ©, elle est capable de dĂ©passer largement la solution optimale linĂ©aire Ă  32QAM. Plus important encore, l’avantage de l’utilisation de modĂšles d’auto-apprentissage comme les NN est leur capacitĂ© Ă  apprendre le canal pendant la formation, alors que la MLSE optimale nĂ©cessite des informations prĂ©cises sur l’état du canal.The future demand for the data bandwidth will surpass the capabilities of current optical communication systems, which are approaching their limits due to the electrical bandwidth limitations of the transmitter components. Inter-symbol interference (ISI) due to this band limitation is the major degradation factor to achieve high data rates. In this thesis, we investigate several neural network (NN) techniques to combat the physical limits of the transmitter components driven at high data rates and exploiting the advanced modulation formats with coherent detection. Our main focus with NNs as ISI channel equalizers is to overcome the limitations of conventional optimal receivers, by providing lower scalable complexity and near optimal solution. We propose a novel deep bidirectional long short-term memory (BiLSTM) architecture, that is effective in mitigating severe ISI caused by bandlimited components. For the first time, we demonstrate via simulation that our proposed deep BiLSTM achieves the same bit error rate (BER) performance as an optimal maximum likelihood sequence estimator (MLSE) for QPSK modulation. The NNs being data-driven models, their performance acutely depends on input data quality. We demonstrate how the achievable deep BiLSTM performance degrades with the increase in modulation order. We also examine the impact of ISI severity and channel memory length on deep BiLSTM performance. We investigate the performances of various synthetic band-limited channels along with a measured optical channel at 100 Gbaud using a 35 GHz silicon photonic(SiP) modulator. The ISI severity of these channels is quantified with a new graphical view of performance based on the baseline performance gaps between conventional linear and nonlinear optimal solutions. At QAM orders above QPSK, we quantify deep BiLSTM performance deviation from the optimal MLSE as ISI severity increases. While deep BiLSTM approaches the optimal MLSE performance at 8QAM and 16QAM with a penalty, it is able to greatly surpass the linear optimal solution at 32QAM. More importantly, the advantage of using self learning models like NNs is their ability to learn the channel during the training, while the optimal MLSE requires accurate channel state information
    • 

    corecore