8,560 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Improving speech recognition by revising gated recurrent units
Speech recognition is largely taking advantage of deep learning, showing that
substantial benefits can be obtained by modern Recurrent Neural Networks
(RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which
typically reach state-of-the-art performance in many tasks thanks to their
ability to learn long-term dependencies and robustness to vanishing gradients.
Nevertheless, LSTMs have a rather complex design with three multiplicative
gates, that might impair their efficient implementation. An attempt to simplify
LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just
two multiplicative gates.
This paper builds on these efforts by further revising GRUs and proposing a
simplified architecture potentially more suitable for speech recognition. The
contribution of this work is two-fold. First, we suggest to remove the reset
gate in the GRU design, resulting in a more efficient single-gate architecture.
Second, we propose to replace tanh with ReLU activations in the state update
equations. Results show that, in our implementation, the revised architecture
reduces the per-epoch training time with more than 30% and consistently
improves recognition performance across different tasks, input features, and
noisy conditions when compared to a standard GRU
Speech enhancement using deep learning
This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results. As a first step, a Signal Analysis Module has been implemented in order to calculate the magnitude and phase of each audio file in the database. The signal is represented into its magnitude and its phase, where the magnitude is modified by the neural network, and then it is reconstructed with the original phase. The implementation of the Neural Networks is divided into two stages.The first stage was the implementation of a Speech Activity Detection Deep Neural Network (SAD-DNN). The magnitude previously calculated, applied to the noisy data, will train the SAD-DNN in order to classify each frame in speech or non-speech. This classification is useful for the network that does the final cleaning. The Speech Activity Detection Deep Neural Network is followed by a Denoising Auto-Encoder (DAE). The magnitude and the label speech or non-speech will be the input of this second Deep Neural Network in charge of denoising the speech signal. The first stage is also optimized to be adequate for the final task in this second stage. In order to do the training, Neural Networks require datasets. In this project the Timit corpus [9] has been used as dataset for the clean voice (target) and the QUT-NOISE TIMIT corpus[4] as noisy dataset (source). Finally, Signal Synthesis Module reconstructs the clean speech signal from the enhanced magnitudes and the phase. In the end, the results provided by the system have been analysed using both objective and subjective measures.Esta tesis explora la posibilidad de conseguir mejorar señales de voz con ruido utilizando Redes Neuronales Profundas. La mejora de señales es un problema clásico del procesado de señal, pero recientemente se esta investigando con deep learning, ya que son técnicas que han dado resultados muy satisfactorios en muchas tareas del procesado de señal. Como primer paso, se ha implementado un Módulo de Análisis de Señal con el objetivo de extraer el módulo y fase de cada archivo de voz de la base de datos. La señal se representa en módulo y fase, donde el módulo se modifica con la red neuronal y posteriormente se reconstruye con la fase original. La implementación de la Red Neuronal consta de dos etapas. En la primera etapa se implementó una Red Neuronal de Detección de Actividad de Voz. El módulo previamente calculado, aplicado a los datos con ruido, se utiliza como entrada para entrenar esta red, de manera que se consigue clasificar cada trama en voz o no voz. Esta clasificación es útil para la red que se encarga de hacer la limpieza. A continuación de la Red Neuronal de Detección de Actividad de Voz se implementa otra, con el objetivo de eliminar el ruido. El módulo junto con la etiqueta obtenida en la red anterior serán la entrada de esta nueva red. En esta segunda etapa también se optimiza la primera para adaptarse a la tarea final. Las Redes Neuronales requieren bases de datos para el entrenamiento. En este proyecto se ha utilizado el Timit corpus [9] como base de datos de voz limpia (objetivo) y el QUT-NOISE TIMIT [4] como base de datos con ruido (fuente). A continuación, el Módulo de SÃntesis de Señal reconstruye la señal de voz limpia a partir del módulo sin ruido y la fase original.Aquesta tesis explora la possibilitat d'aconseguir millorar senyals de veu amb soroll, utilitzant Xarxes Neuronals Profundes. La millora de senyals és un problema clà ssic del processat de senyal, però recentment s'està investigant amb deep learning, ja que són tècniques que han donat resultats molt satisfactoris en moltes tasques de processament de veu. Com a primer pas, s'ha implementat un Mòdul d'Anà lisi de Senyal amb l'objectiu d'extreure el mòdul i la fase de cada arxiu d'à udio de la base de dades. El senyal es representa en mòdul i fase, on el mòdul es modifica amb la xarxa neuronal i posteriorment es reconstrueix amb la fase original. La implementació de les Xarxes Neuronals consta de dues etapes. En la primera etapa es va implementar una Xarxa Neuronal de Detecció d'Activitat de Veu. El mòdul prèviament calculat, aplicat a les dades amb soroll, s'utilitza com entrada per entrenar aquesta xarxa, de manera que s'aconsegueix classificar cada trama en veu o no veu. Aquesta classificació és útil per la xarxa que fa la neteja final. A continuació de la Xarxa Neuronal de Detecció d'Activitat de Veu s'implementa una altra amb l'objectiu d'eliminar el soroll. El mòdul, juntament amb la etiqueta obtinguda en la xarxa anterior, seran l'entrada d'aquesta nova xarxa. En aquesta segona etapa també s'optimitza la primera per adaptar-se a la tasca final. Les Xarxes Neuronals requereixen bases de dades per fer l'entrenament. En aquest projecte s'ha utilitzat el Timit corpus [9] com a base de dades de veu neta (objectiu) i el QUT-NOISE TIMIT[4] com a base de dades amb soroll (font). A continuació, el Mòdul de SÃntesi de Senyal reconstrueix el senyal de veu net a partir del mòdul netejat i la fase original. Finalment, els resultats obtinguts del sistema van ser analitzats utilitzant mesures objectives i subjectives
- …