285,494 research outputs found

    Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization

    Get PDF
    In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepstral Coef?cients (MFCC) and Fuzzy Vector Quantization (FVQ). The ASR algorithm utilizes an approach based on MFCC to identify dynamic features that are used for Speaker Recognition (SR)

    On preprocessing of speech signals

    Get PDF
    Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouragin

    Speaker diarization assisted ASR for multi-speaker conversations

    Full text link
    In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel recordings. We propose a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system with speaker activity assisted acoustic model (AM). An end-to-end neural network system is used for speaker diarization. Two architectures, (i) input conditioned AM, and (ii) gated features AM, are explored to incorporate the speaker activity information. The models output speaker specific senones. The experiments on Switchboard telephone conversations show the advantage of incorporating speaker activity information in the ASR system for recordings with overlapped speech. In particular, an absolute improvement of 11%11\% in word error rate (WER) is seen for the proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202

    Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

    Full text link
    Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.Comment: ICASSP 202

    An improved feature extraction method for Malay vowel recognition based on spectrum delta

    Get PDF
    Malay speech recognition is becoming popular among Malaysian researchers. In Malaysia, more local researchers are focusing on noise robust and accurate independent speaker speech recognition systems that use Malay language.The performance of speech recognition application under adverse noisy condition often becomes the topic of interest among speech recognition researchers in any languages.This paper presents a study of noise robust capability of an improved vowel feature extraction method called Spectrum Delta (SpD).The features are extracted from both original data and noise-added data and classified using three classifiers; (i) Linear Discriminant Analysis (LDA), (ii) K-Nearest Neighbors (k-NN) and (iii) Multinomial Logistic Regression (MLR). Most of the dependent and independent speaker systems which use mostly multi-framed analysis, yielded accuracy between 89% to 100% for dependent speaker system and between 70% to 94% for an independent speaker. This study shows that SpD features obtained an accuracy of 92.42% to 95.11% using all the four classifiers on a single framed analysis which makes this result comparable to those analysed with multi-framed approach

    Discrete Audio Representation as an Alternative to Mel-Spectrograms for Speaker and Speech Recognition

    Full text link
    Discrete audio representation, aka audio tokenization, has seen renewed interest driven by its potential to facilitate the application of text language modeling approaches in audio domain. To this end, various compression and representation-learning based tokenization schemes have been proposed. However, there is limited investigation into the performance of compression-based audio tokens compared to well-established mel-spectrogram features across various speaker and speech related tasks. In this paper, we evaluate compression based audio tokens on three tasks: Speaker Verification, Diarization and (Multi-lingual) Speech Recognition. Our findings indicate that (i) the models trained on audio tokens perform competitively, on average within 1%1\% of mel-spectrogram features for all the tasks considered, and do not surpass them yet. (ii) these models exhibit robustness for out-of-domain narrowband data, particularly in speaker tasks. (iii) audio tokens allow for compression to 20x compared to mel-spectrogram features with minimal loss of performance in speech and speaker related tasks, which is crucial for low bit-rate applications, and (iv) the examined Residual Vector Quantization (RVQ) based audio tokenizer exhibits a low-pass frequency response characteristic, offering a plausible explanation for the observed results, and providing insight for future tokenizer designs.Comment: Preprint. Submitted to ICASSP 202
    • 

    corecore