26 research outputs found
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
MeetEval is an open-source toolkit to evaluate all kinds of meeting
transcription systems. It provides a unified interface for the computation of
commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER
along other WER definitions. We extend the cpWER computation by a temporal
constraint to ensure that only words are identified as correct when the
temporal alignment is plausible. This leads to a better quality of the matching
of the hypothesis string to the reference string that more closely resembles
the actual transcription quality, and a system is penalized if it provides poor
time annotations. Since word-level timing information is often not available,
we present a way to approximate exact word-level timings from segment-level
timings (e.g., a sentence) and show that the approximation leads to a similar
WER as a matching with exact word-level annotations. At the same time, the time
constraint leads to a speedup of the matching algorithm, which outweighs the
additional overhead caused by processing the time stamps.Comment: Accepted for presentation at the Chime7 workshop 202
On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems
We propose a general framework to compute the word error rate (WER) of ASR
systems that process recordings containing multiple speakers at their input and
that produce multiple output word sequences (MIMO). Such ASR systems are
typically required, e.g., for meeting transcription. We provide an efficient
implementation based on a dynamic programming search in a multi-dimensional
Levenshtein distance tensor under the constraint that a reference utterance
must be matched consistently with one hypothesis output. This also results in
an efficient implementation of the ORC WER which previously suffered from
exponential complexity. We give an overview of commonly used WER definitions
for multi-speaker scenarios and show that they are specializations of the above
MIMO WER tuned to particular application scenarios. We conclude with a
discussion of the pros and cons of the various WER definitions and a
recommendation when to use which.Comment: Presented at ICASSP 202
Frame-wise and overlap-robust speaker embeddings for meeting diarization
Using a Teacher-Student training approach we developed a speaker embedding
extraction system that outputs embeddings at frame rate. Given this high
temporal resolution and the fact that the student produces sensible speaker
embeddings even for segments with speech overlap, the frame-wise embeddings
serve as an appropriate representation of the input speech signal for an
end-to-end neural meeting diarization (EEND) system. We show in experiments
that this representation helps mitigate a well-known problem of EEND systems:
when increasing the number of speakers the diarization performance drop is
significantly reduced. We also introduce block-wise processing to be able to
diarize arbitrarily long meetings.Comment: ICASSP 202
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
We introduce a monaural neural speaker embeddings extractor that computes an
embedding for each speaker present in a speech mixture. To allow for supervised
training, a teacher-student approach is employed: the teacher computes the
target embeddings from each speaker's utterance before the utterances are added
to form the mixture, and the student embedding extractor is then tasked to
reproduce those embeddings from the speech mixture at its input. The system
much more reliably verifies the presence or absence of a given speaker in a
mixture than a conventional speaker embedding extractor, and even exhibits
comparable performance to a multi-channel approach that exploits spatial
information for embedding extraction. Further, it is shown that a speaker
embedding computed from a mixture can be used to check for the presence of that
speaker in another mixture.Comment: Accepted for Interspeech 202