We present a novel recurrent neural network (RNN) model for voice activity detection. Our multi-layer RNN model, in which nodes compute quadratic polynomials, outperforms a much larger baseline system composed of Gaussian mixture models (GMMs) and a hand-tuned state machine (SM) for temporal smoothing. All parameters of our RNN model are optimized together, so that it properly weights its preference for temporal continuity against the acoustic features in each frame. Our RNN uses one tenth the parameters and outperforms the GMM+SM baseline system by 26 % reduction in false alarms, reducing overall speech recognition computation time by 17 % while reducing word error rate by 1 % relative. Index Terms — Voice activity detection (VAD), endpointing, recurrent neural networks (RNNs
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.