Search CORE

464 research outputs found

Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses

Author: Ai Yang
Ling Zhen-Hua
Publication venue
Publication date: 16/02/2023
Field of study

This paper presents a novel speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra by neural networks. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is composed of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. Experimental results show that our proposed neural speech phase prediction model outperforms the iterative Griffin-Lim algorithm and other neural network-based method, in terms of both reconstructed speech quality and generation speed.Comment: Accepted by ICASSP 2023. Codes are availabl

arXiv.org e-Print Archive

Deep learning-based music source separation

Author: Mimilakis Stylianos Ioannis
Publication venue
Publication date: 01/01/2021
Field of study

Diese Dissertation befasst sich mit dem Problem der Trennung von Musikquellen durch den Einsatz von deep learning Methoden. Die auf deep learning basierende Trennung von Musikquellen wird unter drei Gesichtspunkten untersucht. Diese Perspektiven sind: die Signalverarbeitung, die neuronale Architektur und die Signaldarstellung. Aus der ersten Perspektive, soll verstanden werden, welche deep learning Modelle, die auf DNNs basieren, für die Aufgabe der Musikquellentrennung lernen, und ob es einen analogen Signalverarbeitungsoperator gibt, der die Funktionalität dieser Modelle charakterisiert. Zu diesem Zweck wird ein neuartiger Algorithmus vorgestellt. Der Algorithmus wird als NCA bezeichnet und destilliert ein optimiertes Trennungsmodell, das aus nicht-linearen Operatoren besteht, in einen einzigen linearen Operator, der leicht zu interpretieren ist. Aus der zweiten Perspektive, soll eine neuronale Netzarchitektur vorgeschlagen werden, die das zuvor erwähnte Konzept der Filterberechnung und -optimierung beinhaltet. Zu diesem Zweck wird die als Masker and Denoiser (MaD) bezeichnete neuronale Netzarchitektur vorgestellt. Die vorgeschlagene Architektur realisiert die Filteroperation unter Verwendung skip-filtering connections Verbindungen. Zusätzlich werden einige Inferenzstrategien und Optimierungsziele vorgeschlagen und diskutiert. Die Leistungsfähigkeit von MaD bei der Musikquellentrennung wird durch eine Reihe von Experimenten bewertet, die sowohl objektive als auch subjektive Bewertungsverfahren umfassen. Abschließend, der Schwerpunkt der dritten Perspektive liegt auf dem Einsatz von DNNs zum Erlernen von solchen Signaldarstellungen, für die Trennung von Musikquellen hilfreich sind. Zu diesem Zweck wird eine neue Methode vorgeschlagen. Die vorgeschlagene Methode verwendet ein neuartiges Umparametrisierungsschema und eine Kombination von Optimierungszielen. Die Umparametrisierung basiert sich auf sinusförmigen Funktionen, die interpretierbare DNN-Darstellungen fördern. Der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Die Ergebnisse der durchgeführten Experimente deuten an, dass die vorgeschlagene Methode beim Erlernen interpretierbarer Darstellungen effizient eingesetzt werden kann, wobei der Filterprozess noch auf separate Musikquellen angewendet werden kann. Darüber hinaus der Einsatz von optimal transport (OT) Entfernungen als Optimierungsziele sind für die Berechnung additiver und klar strukturierter Signaldarstellungen.This thesis addresses the problem of music source separation using deep learning methods. The deep learning-based separation of music sources is examined from three angles. These angles are: the signal processing, the neural architecture, and the signal representation. From the first angle, it is aimed to understand what deep learning models, using deep neural networks (DNNs), learn for the task of music source separation, and if there is an analogous signal processing operator that characterizes the functionality of these models. To do so, a novel algorithm is presented. The algorithm, referred to as the neural couplings algorithm (NCA), distills an optimized separation model consisting of non-linear operators into a single linear operator that is easy to interpret. Using the NCA, it is shown that DNNs learn data-driven filters for singing voice separation, that can be assessed using signal processing. Moreover, by enabling DNNs to learn how to predict filters for source separation, DNNs capture the structure of the target source and learn robust filters. From the second angle, it is aimed to propose a neural network architecture that incorporates the aforementioned concept of filter prediction and optimization. For this purpose, the neural network architecture referred to as the Masker-and-Denoiser (MaD) is presented. The proposed architecture realizes the filtering operation using skip-filtering connections. Additionally, a few inference strategies and optimization objectives are proposed and discussed. The performance of MaD in music source separation is assessed by conducting a series of experiments that include both objective and subjective evaluation processes. Experimental results suggest that the MaD architecture, with some of the studied strategies, is applicable to realistic music recordings, and the MaD architecture has been considered one of the state-of-the-art approaches in the Signal Separation and Evaluation Campaign (SiSEC) 2018. Finally, the focus of the third angle is to employ DNNs for learning signal representations that are helpful for separating music sources. To that end, a new method is proposed using a novel re-parameterization scheme and a combination of optimization objectives. The re-parameterization is based on sinusoidal functions that promote interpretable DNN representations. Results from the conducted experimental procedure suggest that the proposed method can be efficiently employed in learning interpretable representations, where the filtering process can still be applied to separate music sources. Furthermore, the usage of optimal transport (OT) distances as optimization objectives is useful for computing additive and distinctly structured signal representations for various types of music sources

Digitale Bibliothek Thüringen

Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

Author: Ai Yang
Ling Zhen-Hua
Lu Ye-Xin
Publication venue
Publication date: 17/08/2023
Field of study

Speech phase prediction, which is a significant research focus in the field of signal processing, aims to recover speech phase spectra from amplitude-related features. However, existing speech phase prediction methods are constrained to recovering phase spectra with short frame shifts, which are considerably smaller than the theoretical upper bound required for exact waveform reconstruction of short-time Fourier transform (STFT). To tackle this issue, we present a novel long-frame-shift neural speech phase prediction (LFS-NSPP) method which enables precise prediction of long-frame-shift phase spectra from long-frame-shift log amplitude spectra. The proposed method consists of three stages: interpolation, prediction and decimation. The short-frame-shift log amplitude spectra are first constructed from long-frame-shift ones through frequency-by-frequency interpolation to enhance the spectral continuity, and then employed to predict short-frame-shift phase spectra using an NSPP model, thereby compensating for interpolation errors. Ultimately, the long-frame-shift phase spectra are obtained from short-frame-shift ones through frame-by-frame decimation. Experimental results show that the proposed LFS-NSPP method can yield superior quality in predicting long-frame-shift phase spectra than the original NSPP model and other signal-processing-based phase estimation algorithms.Comment: Published at IEEE Signal Processing Letter

arXiv.org e-Print Archive

Recommended from our members

Single Channel auditory source separation with neural network

Author: Chen Zhuo
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

Although distinguishing diﬀerent sounds in noisy environment is a relative easy task for human, source separation has long been extremely diﬃcult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds. In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech. In this thesis, we ﬁrstly propose extensions for the current mask learning network, for the problem of speech enhancement, to ﬁx the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under diﬀerent recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data. Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise aﬃnity matrix that approximates the ideal aﬃnity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB. We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform eﬃcient end-to-end training. In the proposed model, attractor points for each source are ﬁrstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by ﬁnding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results. Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classiﬁcation based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system ﬁrstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience

Columbia University Academic Commons

Conditioning Text-to-Speech synthesis on dialect accent: a case study

Author: Falai Alessio
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 22/03/2022
Field of study

Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained the ability to control speech attributes such as degree of emotion, rhythm and timbre, along with other suprasegmental features. More ambitious objectives are related to modelling a combination of speakers and languages, e.g. to enable cross-speaker language transfer. Though, no prior work has been done on the more fine-grained analysis of regional accents. To fill this gap, in this thesis we present practical end-to-end solutions to synthesise speech while controlling within-country variations of the same language, and we do so for 6 different dialects of the British Isles. In particular, we first conduct an extensive study of the speaker verification field and tweak state-of-the-art embedding models to work with dialect accents. Then, we adapt standard acoustic models and voice conversion systems by conditioning them on dialect accent representations and finally compare our custom pipelines with a cutting-edge end-to-end architecture from the multi-lingual world. Results show that the adopted models are suitable and have enough capacity to accomplish the task of regional accent conversion. Indeed, we are able to produce speech closely resembling the selected speaker and dialect accent, where the most accurate synthesis is obtained via careful fine-tuning of the multi-lingual model to the multi-dialect case. Finally, we delineate limitations of our multi-stage approach and propose practical mitigations, to be explored in future work

AMS Tesi di Laurea

Deep Learning in Cardiology

Author: Bizopoulos Paschalis
Koutsouris Dimitrios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/02/2021
Field of study

The medical field is creating large amount of data that physicians are unable to decipher and use efficiently. Moreover, rule-based expert systems are inefficient in solving complicated medical tasks or for creating insights using big data. Deep learning has emerged as a more accurate and effective technology in a wide range of medical problems such as diagnosis, prediction and intervention. Deep learning is a representation learning method that consists of layers that transform the data non-linearly, thus, revealing hierarchical relationships and structures. In this review we survey deep learning application papers that use structured data, signal and imaging modalities from cardiology. We discuss the advantages and limitations of applying deep learning in cardiology that also apply in medicine in general, while proposing certain directions as the most viable for clinical use.Comment: 27 pages, 2 figures, 10 table

arXiv.org e-Print Archive

Reconstruction de phase et de signaux audio avec des fonctions de coût non-quadratiques

Author: Vial Pierre-Hugo
Publication venue
Publication date: 01/01/2022
Field of study

Audio signal reconstruction consists in recovering sound signals from incomplete or degraded representations. This problem can be cast as an inverse problem. Such problems are frequently tackled with the help of optimization or machine learning strategies. In this thesis, we propose to change the cost function in inverse problems related to audio signal reconstruction. We mainly address the phase retrieval problem, which is common when manipulating audio spectrograms. A first line of work tackles the optimization of non-quadratic cost functions for phase retrieval. We study this problem in two contexts: audio signal reconstruction from a single spectrogram and source separation. We introduce a novel formulation of the problem with Bregman divergences, as well as algorithms for its resolution. A second line of work proposes to learn the cost function from a given dataset. This is done under the framework of unfolded neural networks, which are derived from iterative algorithms. We introduce a neural network based on the unfolding of the Alternating Direction Method of Multipliers, that includes learnable activation functions. We expose the relation between the learning of its parameters and the learning of the cost function for phase retrieval. We conduct numerical experiments for each of the proposed methods to evaluate their performance and their potential with audio signal reconstruction

Open Archive Toulouse Archive Ouverte

Explainable deep learning solutions for the artifacts correction of EEG signals

Author
Publication venue
Publication date
Field of study

L'attività celebrale può essere acquisita tramite elettroencefalografia (EEG) con degli elettrodi posti sullo scalpo del soggetto. Quando un segnale EEG viene acquisito si formano degli artefatti dovuti a: movimenti dei muscoli, movimenti degli occhi, attività del cuore o dovuti all'apparecchio di acquisizione stesso. Questi artefatti possono notevolmente compromettere la qualità dei segnali EEG. La rimozione di questi artefatti è fondamentale per molte discipline per ottenere un segnale pulito e poterlo utilizzare nel migli0re dei modi. Il machine learning (ML) è un esempio di tecnica che può essere utilizzata per classificare e rimuovere gli artefatti dai segnali EEG. Il deep learning (DL) è una branca del ML che è sviluppata ispirandosi all'architettura della corteccia cerebrale umana. Il DL è alla base della creazione dell'intelligenza artificiale e della costruzione di reti neurali (NN). Nella tesi applicheremo ICLabel che è una rete neurale che classifica le componenti indipendenti (IC), ottenute con la scomposizione tramite independent component analysis (ICA), in sette classi differenti: brain, eye, muscle, heart, channel noise, line noise e other. ICLabel calcola la probabilità che le ICs appartengano a ciascuna di queste sette classi. Durante questo lavoro di tesi abbiamo sviluppato una semplice rete neurale, simile a quella di ICLabel, che classifica le ICs in due classi: una contenente le ICs che corrispondono a quelli che sono i segnali base dell'attività cerebrale, l'altra invece contenente le ICs che non appartengono a questi segnali base. Abbiamo creato questa rete neurale per poter applicare poi un algoritmo di explainability (basato sulle reti neurali), chiamato GradCAM. Abbiamo, poi, comparato le performances di ICLabel e della rete neurale da noi sviluppata per vedere le differenze dal punto di vista della accuratezza e della precisione nella classificazione, come descritto nel capitolo. Abbiamo infine applicato GradCAM alla rete neurale da noi sviluppata per capire quali parti del segnale la rete usa per compiere le classificazioni, evidenziando sugli spettrogrammi delle ICs le parti più importanti del segnale. Possiamo dire poi, che come ci aspettavamo la CNN è guidata da componenti come quelle del line noise (che corrisponde alla frequenza di 50 Hz e armoniche più alte) per identificare le componenti non brain, mentre si concentra sul range da 1-30 Hz per identificare quelle brain. Anche se promettenti questi risultati vannno investigati. Inoltre GradCAM potrebbe essere applicato anche su ICLabel per spiegare la sua struttura più complessa.The brain electrical activity can be acquired via electroencephalography (EEG) with electrodes placed on the scalp of the individual. When EEG signals are recorded, signal artifacts such as muscular activities, blinking of eyes, and power line electrical noise can significantly affect the quality of the EEG signals. Machine learning (ML) techniques are an example of method used to classify and remove EEG artifacts. Deep learning is a type of ML inspired by the architecture of the cerebral cortex, that is formed by a dense network of neurons, simple processing units in our brain. In this thesis work we use ICLabel that is an artificial neural network developed by EEGLAB to automatically classify, that classifies the inidpendent component(ICs), obtained by the application of the independent component analysis (ICA), in seven classes, i.e., brain, eye, muscle, heart, channel noise, line noise, other. ICLabel provides the probability that each IC features belongs to one out of 6 artefact classes, or it is a pure brain component. We create a simple CNN similar to the ICLabel's one that classifies the EEG artifacts ICs in two classes, brain and not brain. and we added an explainability tool, i.e., GradCAM, to investigate how the algorithm is able to successfully classify the ICs. We compared the performances f our simple CNN versus those of ICLabel, finding that CNN is able to reach satisfactory accuracies (over two classes, i.e., brain/non-brain). Then we applied GradCAM to the CNN to understand what are the most important parts of the spectrogram that the network used to classify the data and we could speculate that, as expected, the CNN is driven by components such as the power line noise (50 Hz and higher harmonics) to identify non-brain components, while it focuses on the range 1-30 Hz to identify brain components. Although promising, these results need further investigations. Moreover, GradCAM could be later applied to ICLabel, too, in order to explain the more sophisticated DL model with 7 classes

Padua Thesis and Dissertation Archive