20 research outputs found

    Voice Activity Detection Based on Deep Neural Networks

    Get PDF
    Various ambient noises always corrupt the audio obtained in real-world environments, which partially prevents valuable information in human speech. Many speech processing systems, such as automatic speech recognition, speaker recognition and speech emotion recognition, have been widely used to transcribe and interpret the valuable information of human speech to other formats. However, ambient noise and different non-speech sounds in audio may affect the work of speech processing systems. Voice Activity Detection (VAD) acts as the front-end operation of these systems for filtering out undesired sounds. The general goal of VAD is to determine the presence and absence of human speech in audio signals. An effective VAD method can accurately detect human speech segments under low SNR conditions with any noise. In addition, an efficient VAD method meets the requirements of fewer parameters and computation. Recently, deep learning-based approaches have impressive advancements in detection performance by training neural networks with massive data. However, commonly-used neural networks generally contain millions of parameters and require large amounts of computation, which is not feasible for computationally-constrained devices. Besides, most deep learning-based approaches adopt manual acoustic features to highlight characteristics of human speech. But manual features may not be suitable for VAD in some specific scenarios. For example, some acoustic features are hard to discriminate babble noise from target speech when audio is recorded in a crowd. In this thesis, we first propose a computation-efficient VAD neural network using multi-channel features. Multi-channel features allow convolutional kernels to capture contextual and dynamic information simultaneously. The positional mask provides the features with positional information using the positional encoding technique, which requires no trainable parameter and costs negligible computation. The computation-efficient neural network contains convolutional layers, bottleneck layers and a fully-connected layer. In bottleneck layers, channel-attention inverted blocks effectively learn hidden patterns of multi-channel features with acceptable computation cost by adopting depthwise separable convolutions and the channel-attention mechanism. Experiments indicate that the proposed computation-efficient neural network achieves superior performance while requiring a fewer amount of computation compared to baseline methods. We propose an end-to-end VAD model that can learn acoustic features directly from raw audio data. The end-to-end VAD model consists of three main parts: a feature extractor, dual-attention transformer encoder and classifier. The feature extractor employs a condense block to learn acoustic features from raw data. The dual-attention transformer encoder uses dual-path attention to encode local and global information of learned features while maintaining low complexity by utilizing the linear multi-head attention mechanism. The classifier requires few trainable parameters and few amounts of computation due to the non-MLP design. The proposed end-to-end model impressively outperforms the computation-efficient neural network and other baseline methods by a considerable margin

    Audio speech enhancement using masks derived from visual speech

    Get PDF
    The aim of the work in this thesis is to explore how visual speech can be used within monaural masking based speech enhancement to remove interfering noise, with a focus on improving intelligibility. Visual speech has the advantage of not being corrupted by interfering noise and can therefore provide additional information within a speech enhancement framework. More specifically, this work considers audio-only, visual-only and audio-visual methods of mask estimation within deep learning architectures with application to both seen and unseen noise types. To estimate masks from audio and visual speech information, models are developed using deep neural networks, specifically feed-forward (DNN) and recurrent (RNN) neural networks for temporal modelling and convolutional neural networks (CNN) for visual feature extraction. It was found that the proposed layer normalised bi-directional feed-forward hybrid network using gated recurrent units (LNBiGRUDNN) provided best performance across all objective measures for temporal modelling. Also, extracting visual features using both pre-trained and end-to-end trained CNNs outperform traditional active appearance model (AAM) feature extraction across all noise types and SNRs tested. End-to-end CNNs trained on images focused on mouth-only regions-of-interest provided best performance for both audio-visual and visual-only models. The best performing audio-visual masking method outperformed both audio-only and visual-only masking methods in both matched and unseen noise type and SNR dependent conditions. For example, in unseen cafeteria babble noise at -10 dB, audio-visual masking had an ESTOI of 46.8, while audio-only and visual-only masking scored 15.0 and 42.4, and the unprocessed audio scored 9.3. Formal tests show that visual information is critical for improving intelligibility at low SNRs and for generalisation to unseen noise conditions. Experiments in large unconstrained vocabulary speech confirm that the model architectures and approaches developed can generalise to unconstrained speech across noise independent conditions and can be considered for monaural speaker dependent real-world applications

    Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information

    Full text link
    In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by aligning the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms

    Mask-based enhancement of very noisy speech

    Get PDF
    When speech is contaminated by high levels of additive noise, both its perceptual quality and its intelligibility are reduced. Studies show that conventional approaches to speech enhancement are able to improve quality but not intelligibility. However, in recent years, algorithms that estimate a time-frequency mask from noisy speech using a supervised machine learning approach and then apply this mask to the noisy speech have been shown to be capable of improving intelligibility. The most direct way of measuring intelligibility is to carry out listening tests with human test subjects. However, in situations where listening tests are impractical and where some additional uncertainty in the results is permissible, for example during the development phase of a speech enhancer, intrusive intelligibility metrics can provide an alternative to listening tests. This thesis begins by outlining a new intrusive intelligibility metric, WSTOI, that is a development of the existing STOI metric. WSTOI improves STOI by weighting the intelligibility contributions of different time-frequency regions with an estimate of their intelligibility content. The prediction accuracies of WSTOI and STOI are compared for a range of noises and noise suppression algorithms and it is found that WSTOI outperforms STOI in all tested conditions. The thesis then investigates the best choice of mask-estimation algorithm, target mask, and method of applying the estimated mask. A new target mask, the HSWOBM, is proposed that optimises a modified version of WSTOI with a higher frequency resolution. The HSWOBM is optimised for a stochastic noise signal to encourage a mask estimator trained on the HSWOBM to generalise better to unseen noise conditions. A high frequency resolution version of WSTOI is optimised as this gives improvements in predicted quality compared with optimising WSTOI. Of the tested approaches to target mask estimation, the best-performing approach uses a feed-forward neural network with a loss function based on WSTOI. The best-performing feature set is based on the gains produced by a classical speech enhancer and an estimate of the local voiced-speech-plus-noise to noise ratio in different time-frequency regions, which is obtained with the aid of a pitch estimator. When the estimated target mask is applied in the conventional way, by multiplying the speech by the mask in the time-frequency domain, it can result in speech with very poor perceptual quality. The final chapter of this thesis therefore investigates alternative approaches to applying the estimated mask to the noisy speech, in order to improve both intelligibility and quality. An approach is developed that uses the mask to supply prior information about the speech presence probability to a classical speech enhancer that minimises the expected squared error in the log spectral amplitudes. The proposed end-to-end enhancer outperforms existing algorithms in terms of predicted quality and intelligibility for most noise types.Open Acces

    Using Visual Speech Information in Masking Methods for Audio Speaker Separation

    Get PDF
    This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks

    Speech Activity and Speaker Change Point Detection for Online Streams

    Get PDF
    Disertační práce je věnována dvěma si blízkým řečovým úlohám a následně jejich použití v online prostředí. Konkrétně se jedná o úlohy detekce řeči a detekce změny mluvčího. Ty jsou často nedílnou součástí systémů pro zpracování řeči (např. pro diarizaci mluvčích nebo rozpoznávání řeči), kde slouží pro předzpracování akustického signálu. Obě úlohy jsou v literatuře velmi aktivním tématem, ale většina existujících prací je směřována primárně na offline využití. Nicméně právě online nasazení je nezbytné pro některé řečové aplikace, které musí fungovat v reálném čase (např. monitorovací systémy).Úvodní část disertační práce je tvořena třemi kapitolami. V té první jsou vysvětleny základní pojmy a následně je nastíněno využití obou úloh. Druhá kapitola je věnována současnému poznání a je doplněna o přehled existujících nástrojů. Poslední kapitola se skládá z motivace a z praktického použití zmíněných úloh v monitorovacích systémech. V závěru úvodní části jsou stanoveny cíle práce.Následující dvě kapitoly jsou věnovány teoretickým základům obou úloh. Představují vybrané přístupy, které jsou buď relevantní pro disertační práci (porovnání výsledků), nebo jsou zaměřené na použití v online prostředí.V další kapitole je předložen finální přístup pro detekci řeči. Postupný návrh tohoto přístupu, společně s experimentálním vyhodnocením, je zde detailně rozebrán. Přístup dosahuje nejlepších výsledků na korpusu QUT-NOISE-TIMIT v podmínkách s nízkým a středním zašuměním. Přístup je také začleněn do monitorovacího systému, kde doplňuje svojí funkcionalitou rozpoznávač řeči.Následující kapitola detailně představuje finální přístup pro detekci změny mluvčího. Ten byl navržen v rámci několika po sobě jdoucích experimentů, které tato kapitola také přibližuje. Výsledky získané na databázi COST278 se blíží výsledkům, kterých dosáhl referenční offline systém, ale předložený přístup jich docílil v online módu a to s nízkou latencí.Výstupy disertační práce jsou shrnuty v závěrečné kapitole.The main focus of this thesis lies on two closely interrelated tasks, speech activity detection and speaker change point detection, and their applications in online processing. These tasks commonly play a crucial role of speech preprocessors utilized in speech-processing applications, such as automatic speech recognition or speaker diarization. While their use in offline systems is extensively covered in literature, the number of published works focusing on online use is limited.This is unfortunate, as many speech-processing applications (e.g., monitoring systems) are required to be run in real time.The thesis begins with a three-chapter opening part, where the first introductory chapter explains the basic concepts and outlines the practical use of both tasks. It is followed by a chapter, which reviews the current state of the art and lists the existing toolkits. That part is concluded by a chapter explaining the motivation behind this work and the practical use in monitoring systems; ultimately, this chapter sets the main goals of this thesis.The next two chapters cover the theoretical background of both tasks. They present selected approaches relevant to this work (e.g., used for result comparisons) or focused on online processing.The following chapter proposes the final speech activity detection approach for online use. Within this chapter, a detailed description of the development of this approach is available as well as its thorough experimental evaluation. This approach yields state-of-the-art results under low- and medium-noise conditions on the standardized QUT-NOISE-TIMIT corpus. It is also integrated into a monitoring system, where it supplements a speech recognition system.The final speaker change point detection approach is proposed in the following chapter. It was designed in a series of consecutive experiments, which are extensively detailed in this chapter. An experimental evaluation of this approach on the COST278 database shows the performance of approaching the offline reference system while operating in online mode with low latency.Finally, the last chapter summarizes all the results of this thesis
    corecore