66,504 research outputs found

    Pseudo-Stereo Audio Processor

    Get PDF
    Due to both technical and resource limitations, non-professional audio production must often record with a single microphone, creating a mono audio signal. Even some originally multi-channel audio files often combine the separate channels into a single channel to save memory. However, this channel limitation makes any music held within the audio duller during listening. The Pseudo-Stereo Audio Processor remedies this situation, introducing a quadrature phase shift onto a given single-channel audio signal, producing multiple phase shifted output signals. These separate fixed-phase output signals are then recombined to produce a variable phase difference, emulated two-channel version of the input signal, allowing for an easy post-production sound quality enhancement of a single-channel signal that independent or small-scale audio recording studios could utilize. Further, this method of quadrature filtering produces completely decorrelated output signals using a Hilbert Transform, creating unique auditory effects useful in certain aspects of psychoacoustic research much harder to obtain through other means

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    End to End Deep Neural Network Frequency Demodulation of Speech Signals

    Full text link
    Frequency modulation (FM) is a form of radio broadcasting which is widely used nowadays and has been for almost a century. We suggest a software-defined-radio (SDR) receiver for FM demodulation that adopts an end-to-end learning based approach and utilizes the prior information of transmitted speech message in the demodulation process. The receiver detects and enhances speech from the in-phase and quadrature components of its base band version. The new system yields high performance detection for both acoustical disturbances, and communication channel noise and is foreseen to out-perform the established methods for low signal to noise ratio (SNR) conditions in both mean square error and in perceptual evaluation of speech quality score

    Speech enhancement using deep learning

    Get PDF
    This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results. As a first step, a Signal Analysis Module has been implemented in order to calculate the magnitude and phase of each audio file in the database. The signal is represented into its magnitude and its phase, where the magnitude is modified by the neural network, and then it is reconstructed with the original phase. The implementation of the Neural Networks is divided into two stages.The first stage was the implementation of a Speech Activity Detection Deep Neural Network (SAD-DNN). The magnitude previously calculated, applied to the noisy data, will train the SAD-DNN in order to classify each frame in speech or non-speech. This classification is useful for the network that does the final cleaning. The Speech Activity Detection Deep Neural Network is followed by a Denoising Auto-Encoder (DAE). The magnitude and the label speech or non-speech will be the input of this second Deep Neural Network in charge of denoising the speech signal. The first stage is also optimized to be adequate for the final task in this second stage. In order to do the training, Neural Networks require datasets. In this project the Timit corpus [9] has been used as dataset for the clean voice (target) and the QUT-NOISE TIMIT corpus[4] as noisy dataset (source). Finally, Signal Synthesis Module reconstructs the clean speech signal from the enhanced magnitudes and the phase. In the end, the results provided by the system have been analysed using both objective and subjective measures.Esta tesis explora la posibilidad de conseguir mejorar señales de voz con ruido utilizando Redes Neuronales Profundas. La mejora de señales es un problema clásico del procesado de señal, pero recientemente se esta investigando con deep learning, ya que son técnicas que han dado resultados muy satisfactorios en muchas tareas del procesado de señal. Como primer paso, se ha implementado un Módulo de Análisis de Señal con el objetivo de extraer el módulo y fase de cada archivo de voz de la base de datos. La señal se representa en módulo y fase, donde el módulo se modifica con la red neuronal y posteriormente se reconstruye con la fase original. La implementación de la Red Neuronal consta de dos etapas. En la primera etapa se implementó una Red Neuronal de Detección de Actividad de Voz. El módulo previamente calculado, aplicado a los datos con ruido, se utiliza como entrada para entrenar esta red, de manera que se consigue clasificar cada trama en voz o no voz. Esta clasificación es útil para la red que se encarga de hacer la limpieza. A continuación de la Red Neuronal de Detección de Actividad de Voz se implementa otra, con el objetivo de eliminar el ruido. El módulo junto con la etiqueta obtenida en la red anterior serán la entrada de esta nueva red. En esta segunda etapa también se optimiza la primera para adaptarse a la tarea final. Las Redes Neuronales requieren bases de datos para el entrenamiento. En este proyecto se ha utilizado el Timit corpus [9] como base de datos de voz limpia (objetivo) y el QUT-NOISE TIMIT [4] como base de datos con ruido (fuente). A continuación, el Módulo de Síntesis de Señal reconstruye la señal de voz limpia a partir del módulo sin ruido y la fase original.Aquesta tesis explora la possibilitat d'aconseguir millorar senyals de veu amb soroll, utilitzant Xarxes Neuronals Profundes. La millora de senyals és un problema clàssic del processat de senyal, però recentment s'està investigant amb deep learning, ja que són tècniques que han donat resultats molt satisfactoris en moltes tasques de processament de veu. Com a primer pas, s'ha implementat un Mòdul d'Anàlisi de Senyal amb l'objectiu d'extreure el mòdul i la fase de cada arxiu d'àudio de la base de dades. El senyal es representa en mòdul i fase, on el mòdul es modifica amb la xarxa neuronal i posteriorment es reconstrueix amb la fase original. La implementació de les Xarxes Neuronals consta de dues etapes. En la primera etapa es va implementar una Xarxa Neuronal de Detecció d'Activitat de Veu. El mòdul prèviament calculat, aplicat a les dades amb soroll, s'utilitza com entrada per entrenar aquesta xarxa, de manera que s'aconsegueix classificar cada trama en veu o no veu. Aquesta classificació és útil per la xarxa que fa la neteja final. A continuació de la Xarxa Neuronal de Detecció d'Activitat de Veu s'implementa una altra amb l'objectiu d'eliminar el soroll. El mòdul, juntament amb la etiqueta obtinguda en la xarxa anterior, seran l'entrada d'aquesta nova xarxa. En aquesta segona etapa també s'optimitza la primera per adaptar-se a la tasca final. Les Xarxes Neuronals requereixen bases de dades per fer l'entrenament. En aquest projecte s'ha utilitzat el Timit corpus [9] com a base de dades de veu neta (objectiu) i el QUT-NOISE TIMIT[4] com a base de dades amb soroll (font). A continuació, el Mòdul de Síntesi de Senyal reconstrueix el senyal de veu net a partir del mòdul netejat i la fase original. Finalment, els resultats obtinguts del sistema van ser analitzats utilitzant mesures objectives i subjectives
    corecore