781 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Studies on noise robust automatic speech recognition
Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK
The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes
This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations
Multi-candidate missing data imputation for robust speech recognition
The application of Missing Data Techniques (MDT) to increase the noise robustness of HMM/GMM-based large vocabulary speech recognizers is hampered by a large computational burden. The likelihood evaluations imply solving many constrained least squares (CLSQ) optimization problems. As an alternative, researchers have proposed frontend MDT or have made oversimplifying independence assumptions for the backend acoustic model. In this article, we propose a fast Multi-Candidate (MC) approach that solves the per-Gaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians. Experiments show that the MC MDT runs equally fast as the uncompensated recognizer while achieving the accuracy of the full backend optimization approach. The experiments also show that exploiting the more accurate acoustic model of the backend does pay off in terms of accuracy when compared to frontend MDT. © 2012 Wang and Van hamme; licensee Springer.Wang Y., Van hamme H., ''Multi-candidate missing data imputation for robust speech recognition'', EURASIP journal on audio, speech, and music processing, vol. 17, 20 pp., 2012.status: publishe
Development of the Carbon Nanotube Thermoacoustic Loudspeaker
Traditional speakers make sound by attaching a coil to a cone and moving that coil back and forth in a magnetic field (aka moving coil loudspeakers). The physics behind how to generate sound via this velocity boundary condition has largely been unchanged for over a hundred years. Interestingly, around the time moving coil loudspeakers were first investigated the idea of using heat to generate sound was also known. These thermoacoustic speakers heat and cool a thin material at acoustic frequencies to generate the pressure wave (i.e. they use a thermal boundary condition). Unfortunately, when the thermoacoustic principle was initially discovered there was no material with the right properties to heat and cool fast enough. Carbon nanotube (CNT) loudspeakers first generated sound early in the 21st century. At that time there were many questions unanswered about their place in the sound generation toolbox of an engineer.
The main goal of this dissertation was to continue the development of the CNT loudspeaker with focus on practical usage for an acoustic engineer. Prior to 2014, when this effort began, most of the published development work was from material scientists with objective acoustic performance data presented that was not useful beyond the scope of that particular publication. For example, low sound pressure levels in the nearfield at low power inputs was a common metric. Therefore, this effort had three main objectives with emphasis placed on acquiring data at levels and in nomenclature that would be useful to acoustic engineers so they could bring the technology to market, if adequate.
Investigation into the true power efficiency of CNT loudspeakers
Investigation into alternative methods to linearize the pressure response of
CNT loudspeakers
Investigation into the sound quality of CNT loudspeakers
Overall, it was found that CNT loudspeakers are approximately four orders of magnitude less power efficient than traditional moving coil loudspeakers. The non-linear pressure output of the CNT loudspeakers can be linearized with a variety of drive signal processing methods, but the selection of which method to use depends on a variety of factors (e.g. amplification architecture available). In general, all methods studied are on the same order of magnitude power efficiency, but the direct current offset and amplitude modulation drive signal processing methods are superior in terms of sound quality
- …