Search CORE

120,191 research outputs found

Voice Activity Detection Based on Deep Neural Networks

Author: Wang Runze
Publication venue
Publication date: 01/11/2022
Field of study

Various ambient noises always corrupt the audio obtained in real-world environments, which partially prevents valuable information in human speech. Many speech processing systems, such as automatic speech recognition, speaker recognition and speech emotion recognition, have been widely used to transcribe and interpret the valuable information of human speech to other formats. However, ambient noise and different non-speech sounds in audio may affect the work of speech processing systems. Voice Activity Detection (VAD) acts as the front-end operation of these systems for filtering out undesired sounds. The general goal of VAD is to determine the presence and absence of human speech in audio signals. An effective VAD method can accurately detect human speech segments under low SNR conditions with any noise. In addition, an efficient VAD method meets the requirements of fewer parameters and computation. Recently, deep learning-based approaches have impressive advancements in detection performance by training neural networks with massive data. However, commonly-used neural networks generally contain millions of parameters and require large amounts of computation, which is not feasible for computationally-constrained devices. Besides, most deep learning-based approaches adopt manual acoustic features to highlight characteristics of human speech. But manual features may not be suitable for VAD in some specific scenarios. For example, some acoustic features are hard to discriminate babble noise from target speech when audio is recorded in a crowd. In this thesis, we first propose a computation-efficient VAD neural network using multi-channel features. Multi-channel features allow convolutional kernels to capture contextual and dynamic information simultaneously. The positional mask provides the features with positional information using the positional encoding technique, which requires no trainable parameter and costs negligible computation. The computation-efficient neural network contains convolutional layers, bottleneck layers and a fully-connected layer. In bottleneck layers, channel-attention inverted blocks effectively learn hidden patterns of multi-channel features with acceptable computation cost by adopting depthwise separable convolutions and the channel-attention mechanism. Experiments indicate that the proposed computation-efficient neural network achieves superior performance while requiring a fewer amount of computation compared to baseline methods. We propose an end-to-end VAD model that can learn acoustic features directly from raw audio data. The end-to-end VAD model consists of three main parts: a feature extractor, dual-attention transformer encoder and classifier. The feature extractor employs a condense block to learn acoustic features from raw data. The dual-attention transformer encoder uses dual-path attention to encode local and global information of learned features while maintaining low complexity by utilizing the linear multi-head attention mechanism. The classifier requires few trainable parameters and few amounts of computation due to the non-MLP design. The proposed end-to-end model impressively outperforms the computation-efficient neural network and other baseline methods by a considerable margin

Concordia University Research Repository