241 research outputs found
A GAUSSIAN MIXTURE MODEL-BASED SPEAKER RECOGNITION SYSTEM
A human being has lot of unique features and one of them is voice. Speaker recognition is the use of a system to distinguish and identify a person from his/her vocal sound. A speaker recognition system (SRS) can be used as one of the authentication technique, in addition to the conventional authentication methods. This paper represents the overview of voice signal characteristics and speaker recognition techniques. It also discusses the advantages and problem of current SRS. The only biometric system that allows users to authenticate remotely is voice-based SRS, we are in the need of a robust SRS
An application of an auditory periphery model in speaker identification
The number of applications of automatic Speaker Identification (SID) is growing due to the advanced technologies for secure access and authentication in services and devices. In 2016, in a study, the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlear model achieved the best performance among seven recent cochlear models to fit a set of human auditory physiological data. Motivated by the performance of the CAR-FAC, I apply this cochlear model in an SID task for the first time to produce a similar performance to a human auditory system. This thesis investigates the potential of the CAR-FAC model in an SID task. I investigate the capability of the CAR-FAC in text-dependent and text-independent SID tasks. This thesis also investigates contributions of different parameters, nonlinearities, and stages of the CAR-FAC that enhance SID accuracy. The performance of the CAR-FAC is compared with another recent cochlear model called the Auditory Nerve (AN) model. In addition, three FFT-based auditory features â Mel frequency Cepstral Coefficient (MFCC), Frequency Domain Linear Prediction (FDLP), and Gammatone Frequency Cepstral Coefficient (GFCC), are also included to compare their performance with cochlear features. This comparison allows me to investigate a better front-end for a noise-robust SID system. Three different statistical classifiers: a Gaussian Mixture Model with Universal Background Model (GMM-UBM), a Support Vector Machine (SVM), and an I-vector were used to evaluate the performance. These statistical classifiers allow me to investigate nonlinearities in the cochlear front-ends. The performance is evaluated under clean and noisy conditions for a wide range of noise levels. Techniques to improve the performance of a cochlear algorithm are also investigated in this thesis. It was found that the application of a cube root and DCT on cochlear output enhances the SID accuracy substantially
Open-set Speaker Identification
This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition.
The investigations led to a novel method termed âvowel boostingâ to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate âvowel boostingâ. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material
DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in Degraded Audio Signals
Automatic speaker recognition algorithms typically use pre-defined
filterbanks, such as Mel-Frequency and Gammatone filterbanks, for
characterizing speech audio. The design of these filterbanks is based on
domain-knowledge and limited empirical observations. The resultant features,
therefore, may not generalize well to different types of audio degradation. In
this work, we propose a deep learning-based technique to induce the filterbank
design from vast amounts of speech audio. The purpose of such a filterbank is
to extract features robust to degradations in the input audio. To this effect,
a 1D convolutional neural network is designed to learn a time-domain filterbank
called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet
mining technique is developed to efficiently mine the data samples best suited
to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX
filterbanks reveals the presence of both vocal source and vocal tract
characteristics in the extracted features. Experimental results on VOXCeleb2,
NIST SRE 2008 and 2010, and Fisher speech datasets demonstrate the efficacy of
the DeepVOX features across a variety of audio degradations, multi-lingual
speech data, and varying-duration speech audio. The DeepVOX features also
improve the performance of existing speaker recognition algorithms, such as the
xVector-PLDA and the iVector-PLDA
Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition
Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios
A deep audiovisual approach for human confidence classification
Research on self-efficacy and confidence has spread across several subfields of psychology and neuroscience. The role of oneâs confidence is very crucial in the formation of attitude and communication skills. The importance of differentiating the levels of confidence is quite visible in this domain. With the recent advances in extracting behavioral insight from a signal in multiple applications, detecting confidence is found to have great importance. One such prominent application is detecting confidence in interview conversations. We have collected an audiovisual data set of interview conversations with 34 candidates. Every response (from each of the candidate) of this data set is labeled with three levels of confidence: high, medium, and low. Furthermore, we have also developed algorithms to efficiently compute such behavioral confidence from speech and video. A deep learning architecture is proposed for detecting confidence levels (high, medium, and low) from an audiovisual clip recorded during an interview. The achieved unweighted average recall (UAR) reaches 85.9% on audio data and 73.6% on video data captured from an interview session
The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems
This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates
Advanced automatic mixing tools for music
PhDThis thesis presents research on several independent systems that when
combined together can generate an automatic sound mix out of an unknown set
of multiâchannel inputs. The research explores the possibility of reproducing
the mixing decisions of a skilled audio engineer with minimal or no human
interaction. The research is restricted to nonâtime varying mixes for large room
acoustics. This research has applications in dynamic sound music concerts,
remote mixing, recording and postproduction as well as live mixing for
interactive scenes.
Currently, automated mixers are capable of saving a set of static mix
scenes that can be loaded for later use, but they lack the ability to adapt to a
different room or to a different set of inputs. In other words, they lack the
ability to automatically make mixing decisions. The automatic mixer research
depicted here distinguishes between the engineering mixing and the subjective
mixing contributions. This research aims to automate the technical tasks related
to audio mixing while freeing the audio engineer to perform the fineâtuning
involved in generating an aestheticallyâpleasing sound mix. Although the
system mainly deals with the technical constraints involved in generating an
audio mix, the developed system takes advantage of common practices
performed by sound engineers whenever possible. The system also makes use
of interâdependent channel information for controlling signal processing tasks
while aiming to maintain system stability at all times. A working
implementation of the system is described and subjective evaluation between a
human mix and the automatic mix is used to measure the success of the
automatic mixing tools
Sound Object Recognition
Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition.
We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks.
The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings.
Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources
- âŠ