19 research outputs found
Capstrum Coefficient Features Analysis for Multilingual Speaker Identification System
The Capstrum coefficient features analysis plays a crucial role in the overall performance of the multilingual speaker identification system. The objective of the research work to investigates the results that can be obtained when you combine Mel-Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) as feature components for the front-end processing of a multilingual speaker identification system. The MFCC and GFCC feature components combined are suggested to improve the reliability of a multilingual speaker identification system. The GFCC features in recent studies have shown very good robustness against noise and acoustic change. The main idea is to integrate MFCC & GFCC features to improve the overall multilingual speaker identification system performance. The experiment carried out on recently collected multilingual speaker speech database to analysis of GFCC and MFCC. The speech database consists of speech data recorded from 100 speakers including male and female. The speech samples are collected in three different languages Hindi, Marathi and Rajasthani. The extracted features of the speech signals of multiple languages are observed. The results provide an empirical comparison of the MFCC-GFCC combined features and the individual counterparts. The average language-independent multilingual speaker identification rate 84.66% (using MFCC), 93.22% (using GFCC)and 94.77% (using combined features)has been achieved
ATRIBUTOS PNCC PARA RECONOCIMIENTO ROBUSTO DE LOCUTOR INDEPENDIENTE DEL TEXTO
El reconocimiento automático de locutores ha sido sujeto de intensa investigación durante toda la década pasada. Sin embargo las características, del estado de arte de los algoritmos son drásticamente degradados en presencia de ruido. Este artículo se centra en la aplicación de una nueva técnica llamada Power-Normalized Cepstral Coefficients (PNCC) para el reconocimiento de locutor independiente del texto. El objetivo de este estudio es evaluar las características de esta técnica en comparación con la técnica convencional Mel Frequency Cepstral Coefficients (MFCC) y la técnica Gammatone Frequency Cepstral Coefficients (GFCC).
Adaptive wavelet thresholding with robust hybrid features for text-independent speaker identification system
The robustness of speaker identification system over additive noise channel is crucial for real-world applications. In speaker identification (SID) systems, the extracted features from each speech frame are an essential factor for building a reliable identification system. For clean environments, the identification system works well; in noisy environments, there is an additive noise, which is affect the system. To eliminate the problem of additive noise and to achieve a high accuracy in speaker identification system a proposed algorithm for feature extraction based on speech enhancement and a combined features is presents. In this paper, a wavelet thresholding pre-processing stage, and feature warping (FW) techniques are used with two combined features named power normalized cepstral coefficients (PNCC) and gammatone frequency cepstral coefficients (GFCC) to improve the identification system robustness against different types of additive noises. Universal Background Model Gaussian Mixture Model (UBM-GMM) is used for features matching between the claim and actual speakers. The results showed performance improvement for the proposed feature extraction algorithm of identification system comparing with conventional features over most types of noises and different SNR ratios
Robust cepstral feature for bird sound classification
Birds are excellent environmental indicators and may indicate sustainability of the ecosystem; birds may be used to provide provisioning, regulating, and supporting services. Therefore, birdlife conservation-related researches always receive centre stage. Due to the airborne nature of birds and the dense nature of the tropical forest, bird identifications through audio may be a better solution than visual identification. The goal of this study is to find the most appropriate cepstral features that can be used to classify bird sounds more accurately. Fifteen (15) endemic Bornean bird sounds have been selected and segmented using an automated energy-based algorithm. Three (3) types of cepstral features are extracted; linear prediction cepstrum coefficients (LPCC), mel frequency cepstral coefficients (MFCC), gammatone frequency cepstral coefficients (GTCC), and used separately for classification purposes using support vector machine (SVM). Through comparison between their prediction results, it has been demonstrated that model utilising GTCC features, with 93.3% accuracy, outperforms models utilising MFCC and LPCC features. This demonstrates the robustness of GTCC for bird sounds classification. The result is significant for the advancement of bird sound classification research, which has been shown to have many applications such as in eco-tourism and wildlife management
Faked Speech Detection with Zero Knowledge
Audio is one of the most used ways of human communication, but at the same
time it can be easily misused to trick people. With the revolution of AI, the
related technologies are now accessible to almost everyone thus making it
simple for the criminals to commit crimes and forgeries. In this work, we
introduce a neural network method to develop a classifier that will blindly
classify an input audio as real or mimicked; the word 'blindly' refers to the
ability to detect mimicked audio without references or real sources. The
proposed model was trained on a set of important features extracted from a
large dataset of audios to get a classifier that was tested on the same set of
features from different audios. The data was extracted from two raw datasets,
especially composed for this work; an all English dataset and a mixed dataset
(Arabic plus English). These datasets have been made available, in raw form,
through GitHub for the use of the research community at
https://github.com/SaSs7/Dataset. For the purpose of comparison, the audios
were also classified through human inspection with the subjects being the
native speakers. The ensued results were interesting and exhibited formidable
accuracy.Comment: 14 pages, 4 figures (6 if you count subfigures), 2 table
ROBUST HYBRID FEATURES BASED TEXT INDEPENDENT SPEAKER IDENTIFICATION SYSTEM OVER NOISY ADDITIVE CHANNEL
Robustness of speaker identification systems over additive noise is crucial for real-world applications. In this paper, two robust features named Power Normalized Cepstral Coefficients (PNCC) and Gammatone Frequency Cepstral Coefficients (GFCC) are combined together to improve the robustness of speaker identification system over different types of noise. Universal Background Model Gaussian Mixture Model (UBM-GMM) is used as a feature matching and a classifier to identify the claim speakers. Evaluation results show that the proposed hybrid feature improves the performance of identification system when compared to conventional features over most types of noise and different signal-to-noise ratios
DeepVOX: Discovering Features from Raw Audio for Speaker Recognition in Degraded Audio Signals
Automatic speaker recognition algorithms typically use pre-defined
filterbanks, such as Mel-Frequency and Gammatone filterbanks, for
characterizing speech audio. The design of these filterbanks is based on
domain-knowledge and limited empirical observations. The resultant features,
therefore, may not generalize well to different types of audio degradation. In
this work, we propose a deep learning-based technique to induce the filterbank
design from vast amounts of speech audio. The purpose of such a filterbank is
to extract features robust to degradations in the input audio. To this effect,
a 1D convolutional neural network is designed to learn a time-domain filterbank
called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet
mining technique is developed to efficiently mine the data samples best suited
to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX
filterbanks reveals the presence of both vocal source and vocal tract
characteristics in the extracted features. Experimental results on VOXCeleb2,
NIST SRE 2008 and 2010, and Fisher speech datasets demonstrate the efficacy of
the DeepVOX features across a variety of audio degradations, multi-lingual
speech data, and varying-duration speech audio. The DeepVOX features also
improve the performance of existing speaker recognition algorithms, such as the
xVector-PLDA and the iVector-PLDA