2 research outputs found
Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system
Automatic meeting analysis is an essential fundamental technology required to
let, e.g. smart devices follow and respond to our conversations. To achieve an
optimal automatic meeting analysis, we previously proposed an all-neural
approach that jointly solves source separation, speaker diarization and source
counting problems in an optimal way (in a sense that all the 3 tasks can be
jointly optimized through error back-propagation). It was shown that the method
could well handle simulated clean (noiseless and anechoic) dialog-like data,
and achieved very good performance in comparison with several conventional
methods. However, it was not clear whether such all-neural approach would be
successfully generalized to more complicated real meeting data containing more
spontaneously-speaking speakers, severe noise and reverberation, and how it
performs in comparison with the state-of-the-art systems in such scenarios. In
this paper, we first consider practical issues required for improving the
robustness of the all-neural approach, and then experimentally show that, even
in real meeting scenarios, the all-neural approach can perform effective speech
enhancement, and simultaneously outperform state-of-the-art systems.Comment: 8 pages, to appear in ICASSP202
Block-Online Guided Source Separation
We propose a block-online algorithm of guided source separation (GSS). GSS is
a speech separation method that uses diarization information to update
parameters of the generative model of observation signals. Previous studies
have shown that GSS performs well in multi-talker scenarios. However, it
requires a large amount of calculation time, which is an obstacle to the
deployment of online applications. It is also a problem that the offline GSS is
an utterance-wise algorithm so that it produces latency according to the length
of the utterance. With the proposed algorithm, block-wise input samples and
corresponding time annotations are concatenated with those in the preceding
context and used to update the parameters. Using the context enables the
algorithm to estimate time-frequency masks accurately only from one iteration
of optimization for each block, and its latency does not depend on the
utterance length but predetermined block length. It also reduces calculation
cost by updating only the parameters of active speakers in each block and its
context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the
proposed algorithm achieved almost the same performance as the conventional
offline GSS algorithm but with 32x faster calculation, which is sufficient for
real-time applications.Comment: Accepted to SLT 202