Sound event detection (SED) methods typically rely on either strongly
labelled data or weakly labelled data. As an alternative, sequentially labelled
data (SLD) was proposed. In SLD, the events and the order of events in audio
clips are known, without knowing the occurrence time of events. This paper
proposes a connectionist temporal classification (CTC) based SED system that
uses SLD instead of strongly labelled data, with a novel unsupervised
clustering stage. Experiments on 41 classes of sound events show that the
proposed two-stage method trained on SLD achieves performance comparable to the
previous state-of-the-art SED system trained on strongly labelled data, and is
far better than another state-of-the-art SED system trained on weakly labelled
data, which indicates the effectiveness of the proposed two-stage method
trained on SLD without any onset/offset time of sound events