1,481 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
An Overview on Application of Machine Learning Techniques in Optical Networks
Today's telecommunication networks have become sources of enormous amounts of
widely heterogeneous data. This information can be retrieved from network
traffic traces, network alarms, signal quality indicators, users' behavioral
data, etc. Advanced mathematical tools are required to extract meaningful
information from these data and take decisions pertaining to the proper
functioning of the networks from the network-generated data. Among these
mathematical tools, Machine Learning (ML) is regarded as one of the most
promising methodological approaches to perform network-data analysis and enable
automated network self-configuration and fault management. The adoption of ML
techniques in the field of optical communication networks is motivated by the
unprecedented growth of network complexity faced by optical networks in the
last few years. Such complexity increase is due to the introduction of a huge
number of adjustable and interdependent system parameters (e.g., routing
configurations, modulation format, symbol rate, coding schemes, etc.) that are
enabled by the usage of coherent transmission/reception technologies, advanced
digital signal processing and compensation of nonlinear effects in optical
fiber propagation. In this paper we provide an overview of the application of
ML to optical communications and networking. We classify and survey relevant
literature dealing with the topic, and we also provide an introductory tutorial
on ML for researchers and practitioners interested in this field. Although a
good number of research papers have recently appeared, the application of ML to
optical networks is still in its infancy: to stimulate further work in this
area, we conclude the paper proposing new possible research directions
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art
performance for automatic speech recognition (ASR). With these large and deep
models overfitting remains the largest problem, outweighing performance
improvements that can be obtained from better architectures. One solution to
the overfitting problem is increasing the amount of available training data and
the variety exhibited by the training data with the help of data augmentation.
In this paper we examine the influence of three data augmentation methods on
the performance of two S2S model architectures. One of the data augmentation
method comes from literature, while two other methods are our own development -
a time perturbation in the frequency domain and sub-sequence sampling. Our
experiments on Switchboard and Fisher data show state-of-the-art performance
for S2S models that are trained solely on the speech training data and do not
use additional text data.Comment: To appear in ICASSP 202
No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation
Automatic speech recognition (ASR) systems are known to be sensitive to the
sociolinguistic variability of speech data, in which gender plays a crucial
role. This can result in disparities in recognition accuracy between male and
female speakers, primarily due to the under-representation of the latter group
in the training data. While in the context of hybrid ASR models several
solutions have been proposed, the gender bias issue has not been explicitly
addressed in end-to-end neural architectures. To fill this gap, we propose a
data augmentation technique that manipulates the fundamental frequency (f0) and
formants. This technique reduces the data unbalance among genders by simulating
voices of the under-represented female speakers and increases the variability
within each gender group. Experiments on spontaneous English speech show that
our technique yields a relative WER improvement up to 9.87% for utterances by
female speakers, with larger gains for the least-represented f0 ranges.Comment: Accepted at ASRU 202
- …