363 research outputs found

    Delay-Coordinates Embeddings as a Data Mining Tool for Denoising Speech Signals

    Full text link
    In this paper we utilize techniques from the theory of non-linear dynamical systems to define a notion of embedding threshold estimators. More specifically we use delay-coordinates embeddings of sets of coefficients of the measured signal (in some chosen frame) as a data mining tool to separate structures that are likely to be generated by signals belonging to some predetermined data set. We describe a particular variation of the embedding threshold estimator implemented in a windowed Fourier frame, and we apply it to speech signals heavily corrupted with the addition of several types of white noise. Our experimental work seems to suggest that, after training on the data sets of interest,these estimators perform well for a variety of white noise processes and noise intensity levels. The method is compared, for the case of Gaussian white noise, to a block thresholding estimator

    Deep Learning Based Speech Enhancement and Its Application to Speech Recognition

    Get PDF
    Speech enhancement is the task that aims to improve the quality and the intelligibility of a speech signal that is degraded by ambient noise and room reverberation. Speech enhancement algorithms are used extensively in many audio- and communication systems, including mobile handsets, speech recognition, speaker verification systems and hearing aids. Recently, deep learning has achieved great success in many applications, such as computer vision, nature language processing and speech recognition. Speech enhancement methods have been introduced that use deep-learning techniques, as these techniques are capable of learning complex hierarchical functions using large-scale training data. This dissertation investigates the deep learning based speech enhancement and its application to robust Automatic Speech Recognition (ASR). We start our work by exploring generative adversarial network (GAN) based speech enhancement. We explore the techniques to extract information about the noise to aid in the reconstruction of the speech signals. The proposed framework, referred to as ForkGAN, is a novel general adversarial learning-based framework that combines deep-learning with conventional noise reduction techniques. We further extend ForkGAN to M-ForkGAN, which integrates feature mapping and mask learning into a unified framework using ForkGAN. Another variant of ForkGAN, named S-ForkGAN, operates on spectral-domain features, which could directly apply to ASR. Systematic evaluations demonstrate the effectiveness of the proposed approaches. Then, we propose a novel multi-stage learning speech enhancement system. Each stage comprises a self-attention (SA) block followed by stacks of temporal convolutional network (TCN) blocks with doubling dilation factors. Each stage generates a prediction that is refined in a subsequent stage. A fusion block is inserted at the input of later stages to re-inject original information. Moreover, we design several multi-scale architectures with perceptual loss. Experiments show that our proposed architectures can achieve the state of the art performance on several public datasets. Recently, modeling to learn the acoustic noisy-clean speech mapping has been enhanced by including auxiliary information such as visual cues, phonetic and linguistic information, and speaker information. We propose a novel speaker-aware speech enhancement (SASE) method that extracts speaker information from a clean reference using long short-term memory (LSTM) layers, and then uses a convolutional recurrent neural network (CRN) to embed the extracted speaker information. The SASE framework is extended with a self-attention mechanism. It is shown that a few seconds of clean reference speech is sufficient, and that the proposed SASE method performs well for a wide range of scenarios. Even though speech enhancement methods that are based on deep learning have demonstrated state-of-the-art performance when compared with conventional methodologies, current deep learning approaches heavily rely on supervised learning, which requires a large number of noisy- and clean-speech sample pairs for training. This is generally not practical in a realistic environment. One cannot simultaneously obtain both noisy and clean speech samples. Thus, most speech enhancement approaches are trained with simulated speech and clean targets. In addition, it would be hard to collect large-scale dataset for the low-resource languages. We propose a novel noise-to-noise speech enhancement (N2N-SE) method that addresses the parallel noisy-clean training data issue, we leverage signal reconstruction techniques by only using corrupted speech. The proposed N2N-SE framework includes a noise conversion module that is an auto-encoder that learns to mix noise with speech, and a speech enhancement module, that learns to reconstruct corrupted speech signals. In addition to additive noise, speech is also affected by reverberation, which is caused by the attenuated and delayed reflections of sound waves. These distortions, particularly when combined, can severely degrade speech intelligibility for human listeners and impact applications, e.g., automatic speech recognition (ASR) and speaker recognition. Thus, effective speech denoising and dereverberation will benefit both speech processing applications and human listeners. We investigate the deep-learning based approaches for both speech dereverberation and speech denoising using the cascade Conformer architecture. The experimental results show that the proposed cascade Conformer can be effective to suppress the noise and reverberation

    Multipath channel identification by using global optimization in ambiguity function domain

    Get PDF
    Cataloged from PDF version of article.A new transform domain array signal processing technique is proposed for identification of multipath communication channels. The received array element outputs are transformed to delay-Doppler domain by using the cross-ambiguity function (CAF) for efficient exploitation of the delay-Doppler diversity of the multipath components. Clusters of multipath components can be identified by using a simple amplitude thresholding in the delay-Doppler domain. Particle swarm optimization (PSO) can be used to identify parameters of the multipath components in each cluster. The performance of the proposed PSO-CAF technique is compared with the space alternating generalized expectation maximization (SAGE) technique and with a recently proposed PSO based technique at various SNR levels. Simulation results clearly quantify the superior performance of the PSO-CAF technique over the alternative techniques at all practically significant SNR levels. (C) 2011 Elsevier B.V. All rights reserved

    A Study on MIMO Wireless Communication Channel Performance in Correlated Channels

    Get PDF
    MIMO wireless communication system is gaining popularity by days due to its versatility and wide applicability. When signal travels through wireless link it gets affected due to the disturbances present in the channel i.e. different sorts of interference and noise. Plus because there may or may not be a Line of sight (LOS) path between transmitter and receiver signal copies leaving the transmitter at the same time reaches the receiver with different delays and attenuation due to multiple reflections and interfere with each other at the receiver. Therefore fading of received signal power is also observed in case of a wireless MIMO link. In case of wireless two most important objectives can be channel estimation and signal detection. The importance of the wireless channel estimation can be attributed to faithful signal detection and transmit beam forming, power allocation etc. when Channel state information (CSI) is communicated to the transmitter via feedback loop in case of uni-directional channel or by simultaneous estimation by the transmitter itself in case of bi-directional channel. This text introduces some aspects of signal detection and mostly different aspects of channel estimation and explains why it is important in context of signal detection, beam forming etc. A brief introduction to antenna arrays and beam forming procedures have been given here. The cause of occurrence of spatial and temporal correlations have been discussed and different ways of modelling the spatial and temporal correlations involved are also briefly introduced in this text. How different link and link-end properties e.g. antenna spacing, angular spread of radiation beam, mean angle of radiation, mutual coupling present between elements of an antenna array etc. affects the channel correlations thereby affecting the performance of the MIMO wireless communication channel. Modelling of antenna mutual coupling and different estimation and compensation techniques are also discussed here

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mètodes per tractar les limitacions de les tècniques existents de separació de fonts musicals en condicions de baixa i alta latència. En primer lloc, ens centrem en els mètodes amb un baix cost computacional i baixa latència. Proposem l'ús de la regularització de Tikhonov com a mètode de descomposició de l'espectre en el context de baixa latència. El comparem amb les tècniques existents en tasques d'estimació i seguiment dels tons, que són passos crucials en molts mètodes de separació. A continuació utilitzem i avaluem el mètode de descomposició de l'espectre en tasques de separació de veu cantada, baix i percussió. En segon lloc, proposem diversos mètodes d'alta latència que milloren la separació de la veu cantada, gràcies al modelatge de components específics, com la respiració i les consonants. Finalment, explorem l'ús de correlacions temporals i anotacions manuals per millorar la separació dels instruments de percussió i dels senyals musicals polifònics complexes.Esta tesis propone métodos para tratar las limitaciones de las técnicas existentes de separación de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los métodos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularización de Tikhonov como método de descomposición del espectro en el contexto de baja latencia. Lo comparamos con las técnicas existentes en tareas de estimación y seguimiento de los tonos, que son pasos cruciales en muchos métodos de separación. A continuación utilizamos y evaluamos el método de descomposición del espectro en tareas de separación de voz cantada, bajo y percusión. En segundo lugar, proponemos varios métodos de alta latencia que mejoran la separación de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiración y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separación de los instrumentos de percusión y señales musicales polifónicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals
    corecore