15 research outputs found

    Comparative study of voice print Based acoustic features: MFCC and LPCC

    Full text link
    Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker's identity which is truly based on the human's voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction

    Spoof detection using time-delay shallow neural network and feature switching

    Full text link
    Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively

    Reconocimiento de patrones de habla usando MFCC y RNA

    Get PDF
    In this work the results of the design and development of an algorithm based on artificial intelligence and MFCC for recognizing speech patterns are presented. The using of MFCC allowed to characterize voice signals, having into account the noise in the record environment, which helps with the estimation of common patterns among these signals when presents disturbances. As a main result of this work, a recognizing rate between 93 and 96% for the selected vowels (/a/,/e/,/o/) was achieved. For the training a number of 22 samples were used and others 11 for the validation process. The samples were obtained from 11 test subjects, all of them of male genre.En este trabajo se presentan los resultados del diseรฑo y desarrollo de un algoritmo basado en inteligencia artificial para el reconocimiento de patrones de vocablos del idioma espaรฑol, utilizando Coe๏ฌcientes Cepstrales en las Frecuencias de Mel o (MFCC), para representar el habla a travรฉs de la percepciรณn auditiva del ser humano. La utilizaciรณn de MFCC permitiรณ caracterizar las seรฑales de voz teniendo en cuenta el posible ruido presente en el ambiente de grabaciรณn, lo cual ayudo a la obtenciรณn de patrones comunes entre estas seรฑales cuando presentan alteraciones. Como resultado se obtuvo un reconocimiento superior al 95% de las tres vocales escogidas, en este caso la /a/,/e/,/o/, entre un grupo de 22 muestras por vocal para el entrenamiento y 11 muestras para la validaciรณn. Las muestras fueron obtenidas de 11 personas, todas del gรฉnero masculino

    Modeling Sub-Band Information Through Discrete Wavelet Transform to Improve Intelligibility Assessment of Dysarthric Speech

    Get PDF
    The speech signal within a sub-band varies at a fine level depending on the type, and level of dysarthria. The Mel-frequency filterbank used in the computation process of cepstral coefficients smoothed out this fine level information in the higher frequency regions due to the larger bandwidth of filters. To capture the sub-band information, in this paper, four-level discrete wavelet transform (DWT) decomposition is firstly performed to decompose the input speech signal into approximation and detail coefficients, respectively, at each level. For a particular input speech signal, five speech signals representing different sub-bands are then reconstructed using inverse DWT (IDWT). The log filterbank energies are computed by analyzing the short-term discrete Fourier transform magnitude spectra of each reconstructed speech using a 30-channel Mel-filterbank. For each analysis frame, the log filterbank energies obtained across all reconstructed speech signals are pooled together, and discrete cosine transform is performed to represent the cepstral feature, here termed as discrete wavelet transform reconstructed (DWTR)- Mel frequency cepstral coefficient (MFCC). The i-vector based dysarthric level assessment system developed on the universal access speech corpus shows that the proposed DTWRMFCC feature outperforms the conventional MFCC and several other cepstral features reported for a similar task. The usages of DWTR- MFCC improve the detection accuracy rate (DAR) of the dysarthric level assessment system in the text and the speaker-independent test case to 60.094 % from 56.646 % MFCC baseline. Further analysis of the confusion matrices shows that confusion among different dysarthric classes is quite different for MFCC and DWTR-MFCC features. Motivated by this observation, a two-stage classification approach employing discriminating power of both kinds of features is proposed to improve the overall performance of the developed dysarthric level assessment system. The two-stage classification scheme further improves the DAR to 65.813 % in the text and speaker- independent test case

    The soundscape of swarming: Proof of concept for a noninvasive acoustic species identification of swarming Myotis bats

    Get PDF
    Bats emit echolocation calls to orientate in their predominantly dark environment. Recording of speciesโ€specific calls can facilitate species identification, especially when mist netting is not feasible. However, some taxa, such as Myotis bats can be hard to distinguish acoustically. In crowded situations where calls of many individuals overlap, the subtle differences between species are additionally attenuated. Here, we sought to noninvasively study the phenology of Myotis bats during autumn swarming at a prominent hibernaculum. To do so, we recorded sequences of overlapping echolocation calls (Nย =โ€‰564) during nights of high swarming activity and extracted spectral parameters (peak frequency, start frequency, spectral centroid) and linear frequency cepstral coefficients (LFCCs), which additionally encompass the timbre (vocal โ€œcolorโ€) of calls. We used this parameter combination in a stepwise discriminant function analysis (DFA) to classify the call sequences to species level. A set of previously identified call sequences of single flying Myotis daubentonii and Myotis nattereri, the most common species at our study site, functioned as a training set for the DFA. 90.2% of the call sequences could be assigned to either M. daubentonii or M. nattereri, indicating the predominantly swarming species at the time of recording. We verified our results by correctly classifying the second set of previously identified call sequences with an accuracy of 100%. In addition, our acoustic species classification corresponds well to the existing knowledge on swarming phenology at the hibernaculum. Moreover, we successfully classified call sequences from a different hibernaculum to species level and verified our classification results by capturing swarming bats while we recorded them. Our findings provide a proof of concept for a new noninvasive acoustic monitoring technique that analyses โ€œswarming soundscapesโ€ by combining classical acoustic parameters and LFCCs, instead of analyzing single calls. Our approach for species identification is especially beneficial in situations with multiple calling individuals, such as autumn swarming

    ROBUST HYBRID FEATURES BASED TEXT INDEPENDENT SPEAKER IDENTIFICATION SYSTEM OVER NOISY ADDITIVE CHANNEL

    Get PDF
    Robustness of speaker identification systems over additive noise is crucial for real-world applications. In this paper, two robust features named Power Normalized Cepstral Coefficients (PNCC) and Gammatone Frequency Cepstral Coefficients (GFCC) are combined together to improve the robustness of speaker identification system over different types of noise. Universal Background Model Gaussian Mixture Model (UBM-GMM) is used as a feature matching and a classifier to identify the claim speakers. Evaluation results show that the proposed hybrid feature improves the performance of identification system when compared to conventional features over most types of noise and different signal-to-noise ratios

    ๋น„ํ™”์ž ์š”์†Œ์— ๊ฐ•์ธํ•œ ํ™”์ž ์ธ์‹์„ ์œ„ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€๋‚จ์ˆ˜.Over the recent years, various deep learning-based embedding methods have been proposed and have shown impressive performance in speaker verification. However, as in most of the classical embedding techniques, the deep learning-based methods are known to suffer from severe performance degradation when dealing with speech samples with different conditions (e.g., recording devices, emotional states). Also, unlike the classical Gaussian mixture model (GMM)-based techniques (e.g., GMM supervector or i-vector), since the deep learning-based embedding systems are trained in a fully supervised manner, it is impossible for them to utilize unlabeled dataset when training. In this thesis, we propose a variational autoencoder (VAE)-based embedding framework, which extracts the total variability embedding and a representation for the uncertainty within the input speech distribution. Unlike the conventional deep learning-based embedding techniques (e.g., d-vector or x-vector), the proposed VAE-based embedding system is trained in an unsupervised manner, which enables the utilization of unlabeled datasets. Furthermore, in order to prevent the potential loss of information caused by the Kullback-Leibler divergence regularization term in the VAE-based embedding framework, we propose an adversarially learned inference (ALI)-based embedding technique. Both VAE- and ALI-based embedding techniques have shown great performance in terms of short duration speaker verification, outperforming the conventional i-vector framework. Additionally, we present a fully supervised training method for disentangling the non-speaker nuisance information from the speaker embedding. The proposed training scheme jointly extracts the speaker and nuisance attribute (e.g., recording channel, emotion) embeddings, and train them to have maximum information on their main-task while ensuring maximum uncertainty on their sub-task. Since the proposed method does not require any heuristic training strategy as in the conventional disentanglement techniques (e.g., adversarial learning, gradient reversal), optimizing the embedding network is relatively more stable. The proposed scheme have shown state-of-the-art performance in RSR2015 Part 3 dataset, and demonstrated its capability in efficiently disentangling the recording device and emotional information from the speaker embedding.์ตœ๊ทผ ๋ช‡๋…„๊ฐ„ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์–ด ์™”์œผ๋ฉฐ, ํ™”์ž ์ธ์‹์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ๊ณ ์ „์ ์ธ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์—์„œ ๋…น์Œ๋œ ์Œ์„ฑ๋“ค์„ ๋ถ„์„ํ•˜๋Š” ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š”๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ (Gaussian mixture model, GMM) ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋“ค (e.g., GMM ์Šˆํผ๋ฒกํ„ฐ, i-๋ฒกํ„ฐ)์™€ ๋‹ฌ๋ฆฌ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•๋“ค์€ ๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™”๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” variational autoencoder (VAE) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉฐ, ํ•ด๋‹น ๊ธฐ๋ฒ•์—์„œ๋Š” ์Œ์„ฑ ๋ถ„ํฌ ํŒจํ„ด์„ ์š”์•ฝํ•˜๋Š” ๋ฒกํ„ฐ์™€ ์Œ์„ฑ ๋‚ด์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ• (e.g., d-๋ฒกํ„ฐ, x-๋ฒกํ„ฐ)์™€๋Š” ๋‹ฌ๋ฆฌ, ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ๋น„๊ต์‚ฌ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์ตœ์ ํ™” ๋˜๊ธฐ์— ๋ผ๋ฒจ์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€ VAE์˜ KL-divergence ์ œ์•ฝ ํ•จ์ˆ˜๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ adversarially learned inference (ALI) ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ VAE ๋ฐ ALI ๊ธฐ๋ฐ˜์˜ ์„ฑ๋ฌธ ์ถ”์ถœ ๊ธฐ๋ฒ•์€ ์งง์€ ์Œ์„ฑ์—์„œ์˜ ํ™”์ž ์ธ์ฆ ์‹คํ—˜์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๊ธฐ์กด์˜ i-๋ฒกํ„ฐ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ฒ•๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ฑ๋ฌธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ๋น„ ํ™”์ž ์š”์†Œ (e.g., ๋…น์Œ ๊ธฐ๊ธฐ, ๊ฐ์ •)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ํ•™์Šต๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํ™”์ž ๋ฒกํ„ฐ์™€ ๋น„ํ™”์ž ๋ฒกํ„ฐ๋ฅผ ๋™์‹œ์— ์ถ”์ถœํ•˜๋ฉฐ, ๊ฐ ๋ฒกํ„ฐ๋Š” ์ž์‹ ์˜ ์ฃผ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ์œ ์ง€ํ•˜๋˜, ๋ถ€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ธฐ์กด์˜ ๋น„ ํ™”์ž ์š”์†Œ ์ •๋ณด ์ œ๊ฑฐ ๊ธฐ๋ฒ•๋“ค (e.g., adversarial learning, gradient reversal)์— ๋น„ํ•˜์—ฌ ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ํ•™์Šต ์ „๋žต์„ ์š”ํ•˜์ง€ ์•Š๊ธฐ์—, ๋ณด๋‹ค ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์€ RSR2015 Part3 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๊ธฐ๋ฒ•๋“ค์— ๋น„ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์„ฑ๋ฌธ ๋ฒกํ„ฐ ๋‚ด์˜ ๋…น์Œ ๊ธฐ๊ธฐ ๋ฐ ๊ฐ์ • ์ •๋ณด๋ฅผ ์–ต์ œํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.1. Introduction 1 2. Conventional embedding techniques for speaker recognition 7 2.1. i-vector framework 7 2.2. Deep learning-based speaker embedding 10 2.2.1. Deep embedding network 10 2.2.2. Conventional disentanglement methods 13 3. Unsupervised learning of total variability embedding for speaker verification with random digit strings 17 3.1. Introduction 17 3.2. Variational autoencoder 20 3.3. Variational inference model for non-linear total variability embedding 22 3.3.1. Maximum likelihood training 23 3.3.2. Non-linear feature extraction and speaker verification 25 3.4. Experiments 26 3.4.1. Databases 26 3.4.2. Experimental setup 27 3.4.3. Effect of the duration on the latent variable 28 3.4.4. Experiments with VAEs 30 3.4.5. Feature-level fusion of i-vector and latent variable 33 3.4.6. Score-level fusion of i-vector and latent variable 36 3.5. Summary 39 4. Adversarially learned total variability embedding for speaker recognition with random digit strings 41 4.1. Introduction 41 4.2. Adversarially learned inference 43 4.3. Adversarially learned feature extraction 45 4.3.1. Maximum likelihood criterion 47 4.3.2. Adversarially learned inference for non-linear i-vector extraction 49 4.3.3. Relationship to the VAE-based feature extractor 50 4.4. Experiments 51 4.4.1. Databases 51 4.4.2. Experimental setup 53 4.4.3. Effect of the duration on the latent variable 54 4.4.4. Speaker verification and identification with different utterance-level features 56 4.5. Summary 62 5. Disentangled speaker and nuisance attribute embedding for robust speaker verification 63 5.1. Introduction 63 5.2. Joint factor embedding 67 5.2.1. Joint factor embedding network architecture 67 5.2.2. Training for joint factor embedding 69 5.3. Experiments 71 5.3.1. Channel disentanglement experiments 71 5.3.2. Emotion disentanglement 82 5.3.3. Noise disentanglement 86 5.4. Summary 87 6. Conclusion 93 Bibliography 95 Abstract (Korean) 105Docto

    Linear versus mel- frequency cepstral coefficients for speaker recognition

    No full text
    Abstractโ€”Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-theart speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community

    ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

    Get PDF
    Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions
    corecore