72 research outputs found
Probabilistic Modelling of Signal Mixtures with Differentiable Dictionaries
We introduce a novel way to incorporate prior information into (semi-)
supervised non-negative matrix factorization, which we call differentiable
dictionary search. It enables general, highly flexible and principled modelling
of mixtures where non-linear sources are linearly mixed. We study its behavior
on an audio decomposition task, and conduct an extensive, highly controlled
study of its modelling capabilities.Comment: Published in the Proceedings of the 29th European Signal Processing
Conference (EUSIPCO 2021), Dublin, Ireland, August 23-27, 2021 (IEEE),
441-44
ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs
In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding
Prompt-and-Refine strategy (Figure 3), two simple but effective
\textbf{training-free} methods to decrease the Token Display Time (TDT) of
streaming ASR models \textbf{without any accuracy loss}. The core idea of
ZeroPrompt is to append zeroed content to each chunk during inference, which
acts like a prompt to encourage the model to predict future tokens even before
they were spoken. We argue that streaming acoustic encoders naturally have the
modeling ability of Masked Language Models and our experiments demonstrate that
ZeroPrompt is engineering cheap and can be applied to streaming acoustic
encoders on any dataset without any accuracy loss. Specifically, compared with
our baseline models, we achieve 350 700ms reduction on First Token
Display Time (TDT-F) and 100 400ms reduction on Last Token Display Time
(TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and
Librispeech datasets.Comment: accepted by interspeech 202
RawNet: Fast End-to-End Neural Vocoder
Neural networks based vocoders have recently demonstrated the powerful
ability to synthesize high quality speech. These models usually generate
samples by conditioning on some spectrum features, such as Mel-spectrum.
However, these features are extracted by using speech analysis module including
some processing based on the human knowledge. In this work, we proposed RawNet,
a truly end-to-end neural vocoder, which use a coder network to learn the
higher representation of signal, and an autoregressive voder network to
generate speech sample by sample. The coder and voder together act like an
auto-encoder network, and could be jointly trained directly on raw waveform
without any human-designed features. The experiments on the Copy-Synthesis
tasks show that RawNet can achieve the comparative synthesized speech quality
with LPCNet, with a smaller model architecture and faster speech generation at
the inference step.Comment: Submitted to Interspeech 2019, Graz, Austri
Differentiable Dictionary Search: Integrating Linear Mixing with Deep Non-Linear Modelling for Audio Source Separation
This paper describes several improvements to a new method for signal
decomposition that we recently formulated under the name of Differentiable
Dictionary Search (DDS). The fundamental idea of DDS is to exploit a class of
powerful deep invertible density estimators called normalizing flows, to model
the dictionary in a linear decomposition method such as NMF, effectively
creating a bijection between the space of dictionary elements and the
associated probability space, allowing a differentiable search through the
dictionary space, guided by the estimated densities. As the initial formulation
was a proof of concept with some practical limitations, we will present several
steps towards making it scalable, hoping to improve both the computational
complexity of the method and its signal decomposition capabilities. As a
testbed for experimental evaluation, we choose the task of frame-level piano
transcription, where the signal is to be decomposed into sources whose activity
is attributed to individual piano notes. To highlight the impact of improved
non-linear modelling of sources, we compare variants of our method to a linear
overcomplete NMF baseline. Experimental results will show that even in the
absence of additional constraints, our models produce increasingly sparse and
precise decompositions, according to two pertinent evaluation measures.Comment: Published in the Proceedings of the 24th International Congress on
Acoustics (ICA 2022), Gyeongju, Korea, October 24-28, 202
- …