1,977 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Convolutional Deblurring for Natural Imaging
In this paper, we propose a novel design of image deblurring in the form of
one-shot convolution filtering that can directly convolve with naturally
blurred images for restoration. The problem of optical blurring is a common
disadvantage to many imaging applications that suffer from optical
imperfections. Despite numerous deconvolution methods that blindly estimate
blurring in either inclusive or exclusive forms, they are practically
challenging due to high computational cost and low image reconstruction
quality. Both conditions of high accuracy and high speed are prerequisites for
high-throughput imaging platforms in digital archiving. In such platforms,
deblurring is required after image acquisition before being stored, previewed,
or processed for high-level interpretation. Therefore, on-the-fly correction of
such images is important to avoid possible time delays, mitigate computational
expenses, and increase image perception quality. We bridge this gap by
synthesizing a deconvolution kernel as a linear combination of Finite Impulse
Response (FIR) even-derivative filters that can be directly convolved with
blurry input images to boost the frequency fall-off of the Point Spread
Function (PSF) associated with the optical blur. We employ a Gaussian low-pass
filter to decouple the image denoising problem for image edge deblurring.
Furthermore, we propose a blind approach to estimate the PSF statistics for two
Gaussian and Laplacian models that are common in many imaging pipelines.
Thorough experiments are designed to test and validate the efficiency of the
proposed method using 2054 naturally blurred images across six imaging
applications and seven state-of-the-art deconvolution methods.Comment: 15 pages, for publication in IEEE Transaction Image Processin
Listening for Sirens: Locating and Classifying Acoustic Alarms in City Scenes
This paper is about alerting acoustic event detection and sound source
localisation in an urban scenario. Specifically, we are interested in spotting
the presence of horns, and sirens of emergency vehicles. In order to obtain a
reliable system able to operate robustly despite the presence of traffic noise,
which can be copious, unstructured and unpredictable, we propose to treat the
spectrograms of incoming stereo signals as images, and apply semantic
segmentation, based on a Unet architecture, to extract the target sound from
the background noise. In a multi-task learning scheme, together with signal
denoising, we perform acoustic event classification to identify the nature of
the alerting sound. Lastly, we use the denoised signals to localise the
acoustic source on the horizon plane, by regressing the direction of arrival of
the sound through a CNN architecture. Our experimental evaluation shows an
average classification rate of 94%, and a median absolute error on the
localisation of 7.5{\deg} when operating on audio frames of 0.5s, and of
2.5{\deg} when operating on frames of 2.5s. The system offers excellent
performance in particularly challenging scenarios, where the noise level is
remarkably high.Comment: 6 pages, 9 figure
Constrained Overcomplete Analysis Operator Learning for Cosparse Signal Modelling
We consider the problem of learning a low-dimensional signal model from a
collection of training samples. The mainstream approach would be to learn an
overcomplete dictionary to provide good approximations of the training samples
using sparse synthesis coefficients. This famous sparse model has a less well
known counterpart, in analysis form, called the cosparse analysis model. In
this new model, signals are characterised by their parsimony in a transformed
domain using an overcomplete (linear) analysis operator. We propose to learn an
analysis operator from a training corpus using a constrained optimisation
framework based on L1 optimisation. The reason for introducing a constraint in
the optimisation framework is to exclude trivial solutions. Although there is
no final answer here for which constraint is the most relevant constraint, we
investigate some conventional constraints in the model adaptation field and use
the uniformly normalised tight frame (UNTF) for this purpose. We then derive a
practical learning algorithm, based on projected subgradients and
Douglas-Rachford splitting technique, and demonstrate its ability to robustly
recover a ground truth analysis operator, when provided with a clean training
set, of sufficient size. We also find an analysis operator for images, using
some noisy cosparse signals, which is indeed a more realistic experiment. As
the derived optimisation problem is not a convex program, we often find a local
minimum using such variational methods. Some local optimality conditions are
derived for two different settings, providing preliminary theoretical support
for the well-posedness of the learning problem under appropriate conditions.Comment: 29 pages, 13 figures, accepted to be published in TS
Recent Progress in Image Deblurring
This paper comprehensively reviews the recent development of image
deblurring, including non-blind/blind, spatially invariant/variant deblurring
techniques. Indeed, these techniques share the same objective of inferring a
latent sharp image from one or several corresponding blurry images, while the
blind deblurring techniques are also required to derive an accurate blur
kernel. Considering the critical role of image restoration in modern imaging
systems to provide high-quality images under complex environments such as
motion, undesirable lighting conditions, and imperfect system components, image
deblurring has attracted growing attention in recent years. From the viewpoint
of how to handle the ill-posedness which is a crucial issue in deblurring
tasks, existing methods can be grouped into five categories: Bayesian inference
framework, variational methods, sparse representation-based methods,
homography-based modeling, and region-based methods. In spite of achieving a
certain level of development, image deblurring, especially the blind case, is
limited in its success by complex application conditions which make the blur
kernel hard to obtain and be spatially variant. We provide a holistic
understanding and deep insight into image deblurring in this review. An
analysis of the empirical evidence for representative methods, practical
issues, as well as a discussion of promising future directions are also
presented.Comment: 53 pages, 17 figure
Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models
In this paper, we propose to extend the deep, complex U-Network architecture
for speech enhancement by incorporating a probabilistic (i.e., variational)
latent space model. The proposed model is evaluated against several ablated
versions of itself in order to study the effects of the variational latent
space model, complex-value processing, and self-attention. Evaluation on the
MS-DNS 2020 and Voicebank+Demand datasets yields consistently high performance.
E.g., the proposed model achieves an SI-SDR of up to 20.2 dB, about 0.5 to 1.4
dB higher than its ablated version without probabilistic latent space, 2-2.4 dB
higher than WaveUNet, and 6.7 dB above PHASEN. Compared to real-valued
magnitude spectrogram processing with a variational U-Net, the complex U-Net
achieves an improvement of up to 4.5 dB SI-SDR. Complex spectrum encoding as
magnitude and phase yields best performance in anechoic conditions whereas real
and imaginary part representation results in better generalization to (novel)
reverberation conditions, possibly due to the underlying physics of sound
- …