15 research outputs found
Deep Learning for Black-Box Modeling of Audio Effects
Virtual analog modeling of audio effects consists of emulating the sound of an audio processor reference device. This digital simulation is normally done by designing mathematical models of these systems. It is often difficult because it seeks to accurately model all components within the effect unit, which usually contains various nonlinearities and time-varying components. Most existing methods for audio effects modeling are either simplified or optimized to a very specific circuit or type of audio effect and cannot be efficiently translated to other types of audio effects. Recently, deep neural networks have been explored as black-box modeling strategies to solve this task, i.e., by using only input–output measurements. We analyse different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and we also introduce a new model based on the combination of the aforementioned models. Through objective perceptual-based metrics and subjective listening tests we explore the performance of these models when modeling various analog audio effects. Thus, we show virtual analog models of nonlinear effects, such as a tube preamplifier; nonlinear effects with memory, such as a transistor-based limiter and nonlinear time-varying effects, such as the rotating horn and rotating woofer of a Leslie speaker cabinet
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Many real-world applications require the prediction of long sequence
time-series, such as electricity consumption planning. Long sequence
time-series forecasting (LSTF) demands a high prediction capacity of the model,
which is the ability to capture precise long-range dependency coupling between
output and input efficiently. Recent studies have shown the potential of
Transformer to increase the prediction capacity. However, there are several
severe issues with Transformer that prevent it from being directly applicable
to LSTF, including quadratic time complexity, high memory usage, and inherent
limitation of the encoder-decoder architecture. To address these issues, we
design an efficient transformer-based model for LSTF, named Informer, with
three distinctive characteristics: (i) a self-attention mechanism,
which achieves in time complexity and memory usage, and has
comparable performance on sequences' dependency alignment. (ii) the
self-attention distilling highlights dominating attention by halving cascading
layer input, and efficiently handles extreme long input sequences. (iii) the
generative style decoder, while conceptually simple, predicts the long
time-series sequences at one forward operation rather than a step-by-step way,
which drastically improves the inference speed of long-sequence predictions.
Extensive experiments on four large-scale datasets demonstrate that Informer
significantly outperforms existing methods and provides a new solution to the
LSTF problem.Comment: 8 pages (main), 5 pages (appendix) and to be appeared in AAAI202
Grapheme-to-Phoneme Conversion with Convolutional Neural Networks
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections, furthermore, a model, which utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network based architecture was also evaluated on the NetTalk dataset. Our method approaches the accuracy of previous state-of-the-art results in terms of phoneme error rate