63 research outputs found

    Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

    Full text link
    Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network which explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec architecture in which the encoder and decoder are bridged by time-frequency Transformers along both time and frequency dimensions. The encoder aims to encode time-frequency representations derived from the input distorted magnitude and phase spectra. The decoder comprises dual-stream magnitude and phase decoders, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude estimation architecture and a phase parallel estimation architecture, respectively. To train the MP-SENet model effectively, we define multi-level loss functions, including mean square error and perceptual metric loss of magnitude spectra, anti-wrapping loss of phase spectra, as well as mean square error and consistency loss of short-time complex spectra. Experimental results demonstrate that our proposed MP-SENet excels in high-quality speech enhancement across multiple tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it successfully avoids the bidirectional compensation effect between the magnitude and phase, leading to a better harmonic restoration. Notably, for the speech denoising task, the MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language Processin

    Artificial Bandwidth Extension of Speech Signals using Neural Networks

    Get PDF
    Although mobile wideband telephony has been standardized for over 15 years, many countries still do not have a nationwide network with good coverage. As a result, many cellphone calls are still downgraded to narrowband telephony. The resulting loss of quality can be reduced by artificial bandwidth extension. There has been great progress in bandwidth extension in recent years due to the use of neural networks. The topic of this thesis is the enhancement of artificial bandwidth extension using neural networks. A special focus is given to hands-free calls in a car, where the risk is high that the wideband connection is lost due to the fast movement. The bandwidth of narrowband transmission is not only reduced towards higher frequencies above 3.5 kHz but also towards lower frequencies below 300 Hz. There are already methods that estimate the low-frequency components quite well, which will therefore not be covered in this thesis. In most bandwidth extension algorithms, the narrowband signal is initially separated into a spectral envelope and an excitation signal. Both parts are then extended separately in order to finally combine both parts again. While the extension of the excitation can be implemented using simple methods without reducing the speech quality compared to wideband speech, the estimation of the spectral envelope for frequencies above 3.5 kHz is not yet solved satisfyingly. Current bandwidth extension algorithms are just able to reduce the quality loss due to narrowband transmission by a maximum of 50% in most evaluations. In this work, a modification for an existing method for excitation extension is proposed which achieves slight improvements while not generating additional computational complexity. In order to enhance the wideband envelope estimation with neural networks, two modifications of the training process are proposed. On the one hand, the loss function is extended with a discriminative part to address the different characteristics of phoneme classes. On the other hand, by using a GAN (generative adversarial network) for the training phase, a second network is added temporarily to evaluate the quality of the estimation. The neural networks that were trained are compared in subjective and objective evaluations. A final listening test addressed the scenario of a hands-free call in a car, which was simulated acoustically. The quality loss caused by the missing high frequency components could be reduced by 60% with the proposed approach.Obwohl die mobile Breitbandtelefonie bereits seit über 15 Jahren standardisiert ist, gibt es oftmals noch kein flächendeckendes Netz mit einer guten Abdeckung. Das führt dazu, dass weiterhin viele Mobilfunkgespräche auf Schmalbandtelefonie heruntergestuft werden. Der damit einhergehende Qualitätsverlust kann mit künstlicher Bandbreitenerweiterung reduziert werden. Das Thema dieser Arbeit sind Methoden zur weiteren Verbesserungen der Qualität des erweiterten Sprachsignals mithilfe neuronaler Netze. Ein besonderer Fokus liegt auf der Freisprech-Telefonie im Auto, da dabei das Risiko besonders hoch ist, dass durch die schnelle Fortbewegung die Breitbandverbindung verloren geht. Bei der Schmalbandübertragung fehlen neben den hochfrequenten Anteilen (etwa 3.5–7 kHz) auch tiefe Frequenzen unterhalb von etwa 300 Hz. Diese tieffrequenten Anteile können mit bereits vorhandenen Methoden gut geschätzt werden und sind somit nicht Teil dieser Arbeit. In vielen Algorithmen zur Bandbreitenerweiterung wird das Schmalbandsignal zu Beginn in eine spektrale Einhüllende und ein Anregungssignal aufgeteilt. Beide Anteile werden dann separat erweitert und schließlich wieder zusammengeführt. Während die Erweiterung der Anregung nahezu ohne Qualitätsverlust durch einfache Methoden umgesetzt werden kann ist die Schätzung der spektralen Einhüllenden für Frequenzen über 3.5 kHz noch nicht zufriedenstellend gelöst. Mit aktuellen Methoden können im besten Fall nur etwa 50% der durch Schmalbandübertragung reduzierten Qualität zurückgewonnen werden. Für die Anregungserweiterung wird in dieser Arbeit eine Variation vorgestellt, die leichte Verbesserungen erzielt ohne dabei einen Mehraufwand in der Berechnung zu erzeugen. Für die Schätzung der Einhüllenden des Breitbandsignals mithilfe neuronaler Netze werden zwei Änderungen am Trainingsprozess vorgeschlagen. Einerseits wird die Kostenfunktion um einen diskriminativen Anteil erweitert, der das Netz besser zwischen verschiedenen Phonemen unterscheiden lässt. Andererseits wird als Architektur ein GAN (Generative adversarial network) verwendet, wofür in der Trainingsphase ein zweites Netz verwendet wird, das die Qualität der Schätzung bewertet. Die trainierten neuronale Netze wurden in subjektiven und objektiven Tests verglichen. Ein abschließender Hörtest diente zur Evaluierung des Freisprechens im Auto, welches akustisch simuliert wurde. Der Qualitätsverlust durch Wegfallen der hohen Frequenzanteile konnte dabei mit dem vorgeschlagenen Ansatz um etwa 60% reduziert werden

    Cellular, Wide-Area, and Non-Terrestrial IoT: A Survey on 5G Advances and the Road Towards 6G

    Full text link
    The next wave of wireless technologies is proliferating in connecting things among themselves as well as to humans. In the era of the Internet of things (IoT), billions of sensors, machines, vehicles, drones, and robots will be connected, making the world around us smarter. The IoT will encompass devices that must wirelessly communicate a diverse set of data gathered from the environment for myriad new applications. The ultimate goal is to extract insights from this data and develop solutions that improve quality of life and generate new revenue. Providing large-scale, long-lasting, reliable, and near real-time connectivity is the major challenge in enabling a smart connected world. This paper provides a comprehensive survey on existing and emerging communication solutions for serving IoT applications in the context of cellular, wide-area, as well as non-terrestrial networks. Specifically, wireless technology enhancements for providing IoT access in fifth-generation (5G) and beyond cellular networks, and communication networks over the unlicensed spectrum are presented. Aligned with the main key performance indicators of 5G and beyond 5G networks, we investigate solutions and standards that enable energy efficiency, reliability, low latency, and scalability (connection density) of current and future IoT networks. The solutions include grant-free access and channel coding for short-packet communications, non-orthogonal multiple access, and on-device intelligence. Further, a vision of new paradigm shifts in communication networks in the 2030s is provided, and the integration of the associated new technologies like artificial intelligence, non-terrestrial networks, and new spectra is elaborated. Finally, future research directions toward beyond 5G IoT networks are pointed out.Comment: Submitted for review to IEEE CS&

    Generative models for music using transformer architectures

    Get PDF
    openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments

    Objective Estimation of Tracheoesophageal Speech Quality

    Get PDF
    Speech quality estimation for pathological voices is becoming an increasingly important research topic. The assessment of the quality and the degree of severity of a disordered speech is important to the clinical treatment and rehabilitation of patients. In particular, patients who have undergone total laryngectomy (larynx removal) produce Tracheoesophageal (TE) speech. In this thesis, we study the problem of TE speech quality estimation using advanced signal processing approaches. Since it is not possible to have a reference (clean) signal corresponding to a given TE speech (disordered) signal, we investigate in particular the non-intrusive techniques (also called single-ended or blind approaches) that do not require a reference signal to deduce the speech quality level. First, we develop a novel TE speech quality estimation based on some existing double-ended (intrusive) speech quality evaluation techniques such as the Perceptual Evaluation Speech Quality (PESQ) and Hearing Aid Speech Quality Index HASQI. The matching pursuit algorithm (MPA) was used to generate a quasi-clean speech signal from a given disordered TE speech signal. Then, by adequately choosing the parameters of the MPA (atoms, number of iterations,...etc) and using the resulting signal as our reference signal in the intrusive algorithm, we show that the resulting intrusive algorithm correlates well with the subjective scores of two TE speech databases. Second, we investigate the extraction of low complexity auditory features for the evaluation of speech quality. An 18-th order Linear Prediction (LP) analysis is performed on each voiced frame of the speech signal. Two evaluation features are extracted corresponding to higher-order statistics of the LP coefficients and the vocal tract model parameters (cross-sectional tubes areas). Using a set of 35 TE speech samples, we perform forward stepwise regression as well as K-fold cross-validation to select the best sets of features that are used in each of the regression models. Finally, the selected features are fitted to different support vector regression models yielding high correlations with subjective scores. Finally, we investigate a new approach for the estimation of the quality of TE speech using deep neural networks (DNNs). A synthetic dataset that consists of 2173 samples was used to train a DNN model that was shown to predict the TE voice quality. The synthetic dataset was formed by mixing 53 normal speech samples with modulated noise signals that had a similar envelope to the speech samples, at different speech-to-modulation noise ratios. A validated instrumental speech quality predictor was used to quantify the perceived quality of speech samples in this database, and these objective quality scores were used for training the DNN model. The DNN model was comprised of an input layer that accepted sixty relevant features extracted through filterbank and linear prediction analyses of the input speech signal, two hidden layers with 15 neurons each, and an output layer that produced the predicted speech quality score. The DNN trained on the synthetic dataset was subsequently applied to four different databases that contained speech samples collected from TE speakers. The DNN-estimated quality scores exhibited a strong correlation with the subjective ratings of the TE samples in all four databases, thus it shows strong robustness compared to those speech quality metrics developed in this thesis or those from the literature

    Novel Architectures and Optimization Algorithms for Training Neural Networks and Applications

    Get PDF
    The two main areas of Deep Learning are Unsupervised and Supervised Learning. Unsupervised Learning studies a class of data processing problems in which only descriptions of objects are known, without label information. Generative Adversarial Networks (GANs) have become among the most widely used unsupervised neural net models. GAN combines two neural nets, generative and discriminative, that work simultaneously. We introduce a new family of discriminator loss functions that adopts a weighted sum of real and fake parts, which we call adaptive weighted loss functions. Using the gradient information, we can adaptively choose weights to train a discriminator in the direction that benefits the GAN\u27s stability. Also, we propose several improvements to the GAN training schemes. One is self-correcting optimization for training a GAN discriminator on Speech Enhancement tasks, which helps avoid ``harmful\u27\u27 training directions for parts of the discriminator loss. The other improvement is a consistency loss, which targets the inconsistency in time and time-frequency domains caused by Fourier Transforms. Contrary to Unsupervised Learning, Supervised Learning uses labels for each object, and it is required to find the relationship between objects and labels. Building computing methods to interpret and represent human language automatically is known as Natural Language Processing which includes tasks such as word prediction, machine translation, etc. In this area, we propose a novel Neumann-Cayley Gated Recurrent Unit (NC-GRU) architecture based on a Neumann series-based Scaled Cayley transformation. The NC-GRU uses orthogonal matrices to prevent exploding gradient problems and enhance long-term memory on various prediction tasks. In addition, we propose using our newly introduced NC-GRU unit inside Neural Nets model to create neural molecular fingerprints. Integrating novel NC-GRU fingerprints and Multi-Task Deep Neural Networks schematics help to improve the performance of several molecular-related tasks. We also introduce a new normalization method - Assorted-Time Normalization, that helps to preserve information from multiple consecutive time steps and normalize using them in Recurrent Nets like architectures. Finally, we propose a Symmetry Structured Convolutional Neural Network (SCNN), an architecture with 2D structured symmetric features over spatial dimensions, that generates and preserves the symmetry structure in the network\u27s convolutional layers