2 research outputs found
Minimum-latency Time-frequency Analysis Using Asymmetric Window Functions
We study the real-time dynamics retrieval from a time series via the
time-frequency (TF) analysis with the minimal latency guarantee. While
different from the well-known intrinsic latency definition in the filter
design, a rigorous definition of intrinsic latency for different time-frequency
representations (TFR) is provided, including the short time Fourier transform
(STFT), synchrosqeezing transform (SST) and reassignment method (RM). To
achieve the minimal latency, a systematic method is proposed to construct an
asymmetric window from a well-designed symmetric one based on the concept of
minimum-phase, if the window satisfies some weak conditions. We theoretically
show that the TFR determined by SST with the constructed asymmetric window does
have a smaller intrinsic latency. Finally, the music onset detection problem is
studied to show the strength of the proposed algorithm.Comment: 29 pages, 7 figure
Unsupervised Low Latency Speech Enhancement with RT-GCC-NMF
In this paper, we present RT-GCC-NMF: a real-time (RT), two-channel blind
speech enhancement algorithm that combines the non-negative matrix
factorization (NMF) dictionary learning algorithm with the generalized
cross-correlation (GCC) spatial localization method. Using a pre-learned
universal NMF dictionary, RT-GCC-NMF operates in a frame-by-frame fashion by
associating individual dictionary atoms to target speech or background
interference based on their estimated time-delay of arrivals (TDOA). We
evaluate RT-GCC-NMF on two-channel mixtures of speech and real-world noise from
the Signal Separation and Evaluation Campaign (SiSEC). We demonstrate that this
approach generalizes to new speakers, acoustic environments, and recording
setups from very little training data, and outperforms all but one of the
algorithms from the SiSEC challenge in terms of overall Perceptual Evaluation
methods for Audio Source Separation (PEASS) scores and compares favourably to
the ideal binary mask baseline. Over a wide range of input SNRs, we show that
this approach simultaneously improves the PEASS and signal to noise ratio
(SNR)-based Blind Source Separation (BSS) Eval objective quality metrics as
well as the short-time objective intelligibility (STOI) and extended STOI
(ESTOI) objective speech intelligibility metrics. A flexible, soft masking
function in the space of NMF activation coefficients offers real-time control
of the trade-off between interference suppression and target speaker fidelity.
Finally, we use an asymmetric short-time Fourier transform (STFT) to reduce the
inherent algorithmic latency of RT-GCC-NMF from 64 ms to 2 ms with no loss in
performance. We demonstrate that latencies within the tolerable range for
hearing aids are possible on current hardware platforms.Comment: Accepted for publication in the IEEE JSTSP Special Issue on Data
Science: Machine Learning for Audio Signal Processin