Recently, end-to-end neural diarization (EEND) is introduced and achieves
promising results in speaker-overlapped scenarios. In EEND, speaker diarization
is formulated as a multi-label prediction problem, where speaker activities are
estimated independently and their dependency are not well considered. To
overcome these disadvantages, we employ the power set encoding to reformulate
speaker diarization as a single-label classification problem and propose the
overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency
can be modeled explicitly. Inspired by the success of two-stage hybrid systems,
we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD)
by involving a speaker overlap-aware post-processing (SOAP) model to
iteratively refine the diarization results of EEND-OLA. Experimental results
show that, compared with the original EEND, the proposed EEND-OLA achieves a
14.39% relative improvement in terms of diarization error rates (DER), and
utilizing SOAP provides another 19.33% relative improvement. As a result, our
method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new
state-of-the-art result on this benchmark to the best of our knowledge.Comment: Accepted by ICASSP202