1 research outputs found
End-Point Detection with State Transition Model based on Chunk-Wise Classification
A state transition model (STM) based on chunk-wise classification was
proposed for end-point detection (EPD). In general, EPD is developed using
frame-wise voice activity detection (VAD) with additional STM, in which the
state transition is conducted based on VAD's frame-level decision (speech or
non-speech). However, VAD errors frequently occur in noisy environments, even
though we use state-of-the-art deep neural network based VAD, which causes the
undesired state transition of STM. In this work, to build robust STM, a state
transition is conducted based on chunk-wise classification as EPD does not need
to be conducted in frame-level. The chunk consists of multiple frames and the
classification of chunk between speech and non-speech is done by aggregating
the decisions of VAD for multiple frames, so that some undesired VAD errors in
a chunk can be smoothed by other correct VAD decisions. Finally, the model was
evaluated in both qualitative and quantitative measures including phone error
rate