Phoneme boundary detection has been studied due to its central role in
various speech applications. In this work, we point out that this task needs to
be addressed not only by algorithmic way, but also by evaluation metric. To
this end, we first propose a state-of-the-art phoneme boundary detector that
operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT
and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries
with significant margin compared to existing models. Furthermore, we note that
there is a limitation on the popular evaluation metric, R-value, and propose
new evaluation metrics that prevent each boundary from contributing to
evaluation multiple times. The proposed metrics reveal the weaknesses of
non-autoregressive baselines and establishes a reliable criterion that suits
for evaluating phoneme boundary detection.Comment: 5 pages, submitted to ICASSP 202