758 research outputs found
Open-set Speaker Identification
This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition.
The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
Voice activity detection (VAD), which classifies frames as speech or
non-speech, is an important module in many speech applications including
speaker verification. In this paper, we propose a novel method, called
self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD
into a deep speaker embedding system. The proposed method is a combination of
the following two approaches. The first approach is soft VAD, which performs a
soft selection of frame-level features extracted from a speaker feature
extractor. The frame-level features are weighted by their corresponding speech
posteriors estimated from the DNN-based VAD, and then aggregated to generate a
speaker embedding. The second approach is self-adaptive VAD, which fine-tunes
the pre-trained VAD on the speaker verification data to reduce the domain
mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes,
namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA).
Experiments on a Korean speech database demonstrate that the verification
performance is improved significantly in real-world environments by using
self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments
The success of automatic speaker verification shows that discriminative
speaker representations can be extracted from neutral speech. However, as a
kind of non-verbal voice, laughter should also carry speaker information
intuitively. Thus, this paper focuses on exploring speaker verification about
utterances containing non-verbal laughter segments. We collect a set of clips
with laughter components by conducting a laughter detection script on VoxCeleb
and part of the CN-Celeb dataset. To further filter untrusted clips,
probability scores are calculated by our binary laughter detection classifier,
which is pre-trained by pure laughter and neutral speech. After that, based on
the clips whose scores are over the threshold, we construct trials under two
different evaluation scenarios: Laughter-Laughter (LL) and Speech-Laughter
(SL). Then a novel method called Laughter-Splicing based Network (LSN) is
proposed, which can significantly boost performance in both scenarios and
maintain the performance on the neutral speech, such as the VoxCeleb1 test set.
Specifically, our system achieves relative 20% and 22% improvement on
Laughter-Laughter and Speech-Laughter trials, respectively. The meta-data and
sample clips have been released at https://github.com/nevermoreLin/Laugh_LSN.Comment: Submitted to ICASSP202
Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System
This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of
this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded.
In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments.
Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks
are made along with proposals for future work
- …