285,494 research outputs found
Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization
In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepstral Coef?cients (MFCC) and Fuzzy Vector Quantization (FVQ). The ASR algorithm utilizes an approach based on MFCC to identify dynamic features that are used for Speaker Recognition (SR)
On preprocessing of speech signals
Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 Ï edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouragin
Speaker diarization assisted ASR for multi-speaker conversations
In this paper, we propose a novel approach for the transcription of speech
conversations with natural speaker overlap, from single channel recordings. We
propose a combination of a speaker diarization system and a hybrid automatic
speech recognition (ASR) system with speaker activity assisted acoustic model
(AM). An end-to-end neural network system is used for speaker diarization. Two
architectures, (i) input conditioned AM, and (ii) gated features AM, are
explored to incorporate the speaker activity information. The models output
speaker specific senones. The experiments on Switchboard telephone
conversations show the advantage of incorporating speaker activity information
in the ASR system for recordings with overlapped speech. In particular, an
absolute improvement of in word error rate (WER) is seen for the
proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Adapting generic speech recognition models to specific individuals is a
challenging problem due to the scarcity of personalized data. Recent works have
proposed boosting the amount of training data using personalized text-to-speech
synthesis. Here, we ask two fundamental questions about this strategy: when is
synthetic data effective for personalization, and why is it effective in those
cases? To address the first question, we adapt a state-of-the-art automatic
speech recognition (ASR) model to target speakers from four benchmark datasets
representative of different speaker types. We show that ASR personalization
with synthetic data is effective in all cases, but particularly when (i) the
target speaker is underrepresented in the global data, and (ii) the capacity of
the global model is limited. To address the second question of why personalized
synthetic data is effective, we use controllable speech synthesis to generate
speech with varied styles and content. Surprisingly, we find that the text
content of the synthetic data, rather than style, is important for speaker
adaptation. These results lead us to propose a data selection strategy for ASR
personalization based on speech content.Comment: ICASSP 202
An improved feature extraction method for Malay vowel recognition based on spectrum delta
Malay speech recognition is becoming popular among Malaysian researchers. In Malaysia, more local researchers are focusing on noise robust and accurate independent speaker speech recognition systems that use Malay language.The performance of speech recognition application under adverse noisy condition often becomes the topic of interest among speech recognition researchers in any languages.This paper presents a study of noise robust capability of an improved vowel feature extraction method called Spectrum Delta (SpD).The features are extracted from both original data and noise-added data and classified using three classifiers; (i) Linear Discriminant Analysis (LDA), (ii) K-Nearest Neighbors (k-NN) and (iii) Multinomial Logistic Regression (MLR). Most of the dependent and independent speaker systems which use mostly multi-framed analysis, yielded accuracy between 89% to 100% for dependent speaker system and between 70% to 94% for an independent speaker. This study shows that SpD features obtained an accuracy of 92.42% to 95.11% using all the four classifiers on a single framed analysis which makes this result comparable to those analysed with multi-framed approach
Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition
Discrete audio representation, aka audio tokenization, has seen renewed
interest driven by its potential to facilitate the application of text language
modeling approaches in audio domain. To this end, various compression and
representation-learning based tokenization schemes have been proposed. However,
there is limited investigation into the performance of compression-based audio
tokens compared to well-established mel-spectrogram features across various
speaker and speech related tasks. In this paper, we evaluate compression based
audio tokens on three tasks: Speaker Verification, Diarization and
(Multi-lingual) Speech Recognition. Our findings indicate that (i) the models
trained on audio tokens perform competitively, on average within of
mel-spectrogram features for all the tasks considered, and do not surpass them
yet. (ii) these models exhibit robustness for out-of-domain narrowband data,
particularly in speaker tasks. (iii) audio tokens allow for compression to 20x
compared to mel-spectrogram features with minimal loss of performance in speech
and speaker related tasks, which is crucial for low bit-rate applications, and
(iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer
exhibits a low-pass frequency response characteristic, offering a plausible
explanation for the observed results, and providing insight for future
tokenizer designs.Comment: Preprint. Submitted to ICASSP 202
- âŠ