35 research outputs found
A General Unfolding Speech Enhancement Method Motivated by Taylor's Theorem
While deep neural networks have facilitated significant advancements in the
field of speech enhancement, most existing methods are developed following
either empirical or relatively blind criteria, lacking adequate guidelines in
pipeline design. Inspired by Taylor's theorem, we propose a general unfolding
framework for both single- and multi-channel speech enhancement tasks.
Concretely, we formulate the complex spectrum recovery into the spectral
magnitude mapping in the neighborhood space of the noisy mixture, in which an
unknown sparse term is introduced and applied for phase modification in
advance. Based on that, the mapping function is decomposed into the
superimposition of the 0th-order and high-order polynomials in Taylor's series,
where the former coarsely removes the interference in the magnitude domain and
the latter progressively complements the remaining spectral detail in the
complex spectrum domain. In addition, we study the relation between adjacent
order terms and reveal that each high-order term can be recursively estimated
with its lower-order term, and each high-order term is then proposed to
evaluate using a surrogate function with trainable weights so that the whole
system can be trained in an end-to-end manner. Given that the proposed
framework is devised based on Taylor's theorem, it possesses improved internal
flexibility. Extensive experiments are conducted on WSJ0-SI84, DNS-Challenge,
Voicebank+Demand, spatialized Librispeech, and L3DAS22 multi-channel speech
enhancement challenge datasets. Quantitative results show that the proposed
approach yields competitive performance over existing top-performing approaches
in terms of multiple objective metrics.Comment: Submitted to TASLP, revised version, 17 page
Artificial Intelligence in the Creative Industries: A Review
This paper reviews the current state of the art in Artificial Intelligence
(AI) technologies and applications in the context of the creative industries. A
brief background of AI, and specifically Machine Learning (ML) algorithms, is
provided including Convolutional Neural Network (CNNs), Generative Adversarial
Networks (GANs), Recurrent Neural Networks (RNNs) and Deep Reinforcement
Learning (DRL). We categorise creative applications into five groups related to
how AI technologies are used: i) content creation, ii) information analysis,
iii) content enhancement and post production workflows, iv) information
extraction and enhancement, and v) data compression. We critically examine the
successes and limitations of this rapidly advancing technology in each of these
areas. We further differentiate between the use of AI as a creative tool and
its potential as a creator in its own right. We foresee that, in the near
future, machine learning-based AI will be adopted widely as a tool or
collaborative assistant for creativity. In contrast, we observe that the
successes of machine learning in domains with fewer constraints, where AI is
the `creator', remain modest. The potential of AI (or its developers) to win
awards for its original creations in competition with human creatives is also
limited, based on contemporary technologies. We therefore conclude that, in the
context of creative industries, maximum benefit from AI will be derived where
its focus is human centric -- where it is designed to augment, rather than
replace, human creativity
Speech Enhancement with Improved Deep Learning Methods
In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components.
First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods.
Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments.
Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods
Persistence in complex systems
Persistence is an important characteristic of many complex systems in nature, related to how long the system remains at a certain state before changing to a different one. The study of complex systems' persistence involves different definitions and uses different techniques, depending on whether short-term or long-term persistence is considered. In this paper we discuss the most important definitions, concepts, methods, literature and latest results on persistence in complex systems. Firstly, the most used definitions of persistence in short-term and long-term cases are presented. The most relevant methods to characterize persistence are then discussed in both cases. A complete literature review is also carried out. We also present and discuss some relevant results on persistence, and give empirical evidence of performance in different detailed case studies, for both short-term and long-term persistence. A perspective on the future of persistence concludes the work.This research has been partially supported by the project PID2020-115454GB-C21 of the Spanish Ministry of Science
and Innovation (MICINN). This research has also been partially supported by Comunidad de Madrid, PROMINT-CM
project (grant ref: P2018/EMT-4366). J. Del Ser would like to thank the Basque Government for its funding support
through the EMAITEK and ELKARTEK programs (3KIA project, KK-2020/00049), as well as the consolidated research group
MATHMODE (ref. T1294-19). GCV work is supported by the European Research Council (ERC) under the ERC-CoG-2014
SEDAL Consolidator grant (grant agreement 647423)