7 research outputs found
Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones
A novel framework for meeting transcription using asynchronous microphones is
proposed in this paper. It consists of audio synchronization, speaker
diarization, utterance-wise speech enhancement using guided source separation,
automatic speech recognition, and duplication reduction. Doing speaker
diarization before speech enhancement enables the system to deal with
overlapped speech without considering sampling frequency mismatch between
microphones. Evaluation on our real meeting datasets showed that our framework
achieved a character error rate (CER) of 28.7 % by using 11 distributed
microphones, while a monaural microphone placed on the center of the table had
a CER of 38.2 %. We also showed that our framework achieved CER of 21.8 %,
which is only 2.1 percentage points higher than the CER in headset
microphone-based transcription.Comment: Accepted to INTERSPEECH 202
Block-Online Guided Source Separation
We propose a block-online algorithm of guided source separation (GSS). GSS is
a speech separation method that uses diarization information to update
parameters of the generative model of observation signals. Previous studies
have shown that GSS performs well in multi-talker scenarios. However, it
requires a large amount of calculation time, which is an obstacle to the
deployment of online applications. It is also a problem that the offline GSS is
an utterance-wise algorithm so that it produces latency according to the length
of the utterance. With the proposed algorithm, block-wise input samples and
corresponding time annotations are concatenated with those in the preceding
context and used to update the parameters. Using the context enables the
algorithm to estimate time-frequency masks accurately only from one iteration
of optimization for each block, and its latency does not depend on the
utterance length but predetermined block length. It also reduces calculation
cost by updating only the parameters of active speakers in each block and its
context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the
proposed algorithm achieved almost the same performance as the conventional
offline GSS algorithm but with 32x faster calculation, which is sufficient for
real-time applications.Comment: Accepted to SLT 202
Meeting Transcription Using Virtual Microphone Arrays
We describe a system that generates speaker-annotated transcripts of meetings
by using a virtual microphone array, a set of spatially distributed
asynchronous recording devices such as laptops and mobile phones. The system is
composed of continuous audio stream alignment, blind beamforming, speech
recognition, speaker diarization using prior speaker information, and system
combination. When utilizing seven input audio streams, our system achieves a
word error rate (WER) of 22.3% and comes within 3% of the close-talking
microphone WER on the non-overlapping speech segments. The speaker-attributed
WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system
are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones,
respectively. The presented system achieves a 13.6% diarization error rate when
10% of the speech duration contains more than one speaker. The contribution of
each component to the overall performance is also investigated, and we validate
the system with experiments on the NIST RT-07 conference meeting test set
Neural Speech Separation Using Spatially Distributed Microphones
This paper proposes a neural network based speech separation method using
spatially distributed microphones. Unlike with traditional microphone array
settings, neither the number of microphones nor their spatial arrangement is
known in advance, which hinders the use of conventional multi-channel speech
separation neural networks based on fixed size input. To overcome this, a novel
network architecture is proposed that interleaves inter-channel processing
layers and temporal processing layers. The inter-channel processing layers
apply a self-attention mechanism along the channel dimension to exploit the
information obtained with a varying number of microphones. The temporal
processing layers are based on a bidirectional long short term memory (BLSTM)
model and applied to each channel independently. The proposed network leverages
information across time and space by stacking these two kinds of layers
alternately. Our network estimates time-frequency (TF) masks for each speaker,
which are then used to generate enhanced speech signals either with TF masking
or beamforming. Speech recognition experimental results show that the proposed
method significantly outperforms baseline multi-channel speech separation
systems.Comment: 5 pages, 2 figures, Interspeech202
Continuous Speech Separation with Ad Hoc Microphone Arrays
Speech separation has been shown effective for multi-talker speech
recognition. Under the ad hoc microphone array setup where the array consists
of spatially distributed asynchronous microphones, additional challenges must
be overcome as the geometry and number of microphones are unknown beforehand.
Prior studies show, with a spatial-temporalinterleaving structure, neural
networks can efficiently utilize the multi-channel signals of the ad hoc array.
In this paper, we further extend this approach to continuous speech separation.
Several techniques are introduced to enable speech separation for real
continuous recordings. First, we apply a transformer-based network for
spatio-temporal modeling of the ad hoc array signals. In addition, two methods
are proposed to mitigate a speech duplication problem during single talker
segments, which seems more severe in the ad hoc array scenarios. One method is
device distortion simulation for reducing the acoustic mismatch between
simulated training data and real recordings. The other is speaker counting to
detect the single speaker segments and merge the output signal channels.
Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous
recordings of concatenated LibriSpeech utterances obtained by multiple
different devices, show the proposed separation method can significantly
improve the ASR accuracy for overlapped speech with little performance
degradation for single talker segments
Far-Field Automatic Speech Recognition
The machine recognition of speech spoken at a distance from the microphones,
known as far-field automatic speech recognition (ASR), has received a
significant increase of attention in science and industry, which caused or was
caused by an equally significant improvement in recognition accuracy. Meanwhile
it has entered the consumer market with digital home assistants with a spoken
language interface being its most prominent application. Speech recorded at a
distance is affected by various acoustic distortions and, consequently, quite
different processing pipelines have emerged compared to ASR for close-talk
speech. A signal enhancement front-end for dereverberation, source separation
and acoustic beamforming is employed to clean up the speech, and the back-end
ASR engine is robustified by multi-condition training and adaptation. We will
also describe the so-called end-to-end approach to ASR, which is a new
promising architecture that has recently been extended to the far-field
scenario. This tutorial article gives an account of the algorithms used to
enable accurate speech recognition from a distance, and it will be seen that,
although deep learning has a significant share in the technological
breakthroughs, a clever combination with traditional signal processing can lead
to surprisingly effective solutions.Comment: accepted for Proceedings of the IEE
A Review of Speaker Diarization: Recent Advances with Deep Learning
Speaker diarization is a task to label audio or video recordings with classes
that correspond to speaker identity, or in short, a task to identify "who spoke
when". In the early years, speaker diarization algorithms were developed for
speech recognition on multispeaker audio recordings to enable speaker adaptive
processing. These algorithms also gained their own value as a standalone
application over time to provide speaker-specific metainformation for
downstream tasks such as audio retrieval. More recently, with the emergence of
deep learning technology, which has driven revolutionary changes in research
and practices across speech application domains, rapid advancements have been
made for speaker diarization. In this paper, we review not only the historical
development of speaker diarization technology but also the recent advancements
in neural speaker diarization approaches. Furthermore, we discuss how speaker
diarization systems have been integrated with speech recognition applications
and how the recent surge of deep learning is leading the way of jointly
modeling these two components to be complementary to each other. By considering
such exciting technical trends, we believe that this paper is a valuable
contribution to the community to provide a survey work by consolidating the
recent developments with neural methods and thus facilitating further progress
toward a more efficient speaker diarization.Comment: This article is a preprint version of the article published in
Computer Speech & Language, Volume 72, March 2022, 10131