356 research outputs found

    Unsupervised Speaker Change Detection for Broadcast News Segmentation

    Get PDF
    This paper presents a speaker change detection system for news broadcast segmentation based on a vector quantization (VQ) approach. The system does not make any assumption about the number of speakers or speaker identity. The system uses mel frequency cepstral coefficients and change detection is done using the VQ distortion measure and is evaluated against two other statistics, namely the symmetric Kullback-Leibler (KL2) distance and the so-called ‘divergence shape distance’. First level alarms are further tested using the VQ distortion. We find that the false alarm rate can be reduced without significant losses in the detection of correct changes. We furthermore evaluate the generalizability of the approach by testing the complete system on an independent set of broadcasts, including a channel not present in the training set. 1

    Efficient speaker recognition for mobile devices

    Get PDF

    Voice Identification Using MFCC and Vector Quantization

    Get PDF
    يعد التعرف على المتحدث أحد المشكلات الأساسية في معالجة الكلام ونمذجة الصوت. تتضمن تطبيقات التعرف على المتحدث المصادقة في أنظمة الأمان ودقة الاختيار. تشكل تطبيقات التعرف على الصوت تحديًا كبيرًا على نطاق واسع حيث يتطلب البحث السريع في قاعدة بيانات الاصوات تقنيات حديثة سريعة وتعتمد على الذكاء الاصطناعي لتحقيق النتائج المرجوة من النظام. تم بذل العديد من الجهود لتحقيق ذلك من خلال إنشاء أنظمة قائمة على المتغيرات وتطوير منهجيات جديدة لتحديد المتحدثين. التعرف على المتحدث هو عملية التعرف على من يتحدث باستخدام الخصائص المستخرجة من موجات الكلام الخاصة به مثل درجة الصوت والنغمة والتردد ويتم إنشاء نماذج المتكلم وحفظها في بيئة النظام وتستخدم لاحقا للتحقق من الهوية المطلوبة من قبل الأشخاص الذين يصلون إلى النظام، والذي يسمح بالوصول إلى الخدمات المختلفة التي يتم التحكم بها عن طريق الصوت، ويشمل تحديد المتحدث جزأين رئيسيين: الجزء الأول هو استخراج الميزات الصوتية أما الجزء الثاني فهو مطابقة ومقارنة هذه الميزات.The speaker identification is one of the fundamental problems in speech processing and voice modeling. The speaker identification applications include authentication in critical security systems and the accuracy of the selection. Large-scale voice recognition applications are a major challenge. Quick search in the speaker database requires fast, modern techniques and relies on artificial intelligence to achieve the desired results from the system. Many efforts are made to achieve this through the establishment of variable-based systems and the development of new methodologies for speaker identification. Speaker identification is the process of recognizing who is speaking using the characteristics extracted from the speech's waves like pitch, tone, and frequency. The speaker's models are created and saved in the system environment and used to verify the identity required by people accessing the systems, which allows access to various services that are controlled by voice, speaker identification involves two main parts: the first part is the feature extraction and the second part is the feature matching

    Speaker Recognition Systems: A Tutorial

    Full text link
    Abstract This paper gives an overview of speaker recognition systems. Speaker recognition is the task of automatically recognizing who is speaking by identifying an unknown speaker among several reference speakers using speaker-specific information included in speech waves. The different classification of speaker recognition and speech processing techniques required for performing the recognition task are discussed. The basic modules of a speaker recognition system are outlined and discussed. Some of the techniques required to implement each module of the system were discussed and others are mentioned. The methods were also compared with one another. Finally, this paper concludes by giving a few research trends in speaker recognition for some year to come

    Word And Speaker Recognition System

    Get PDF
    In this report, a system which combines user dependent Word Recognition and text dependent speaker recognition is described. Word recognition is the process of converting an audio signal, captured by a microphone, to a word. Speaker Identification is the ability to recognize a person identity base on the specific word he/she uttered. A person's voice contains various parameters that convey information such as gender, emotion, health, attitude and identity. Speaker recognition identifies who is the speaker based on the unique voiceprint from the speech data. Voice Activity Detection (VAD), Spectral Subtraction (SS), Mel-Frequency Cepstrum Coefficient (MFCC), Vector Quantization (VQ), Dynamic Time Warping (DTW) and k-Nearest Neighbour (k-NN) are methods used in word recognition part of the project to implement using MATLAB software. For Speaker Recognition part, Vector Quantization (VQ) is used. The recognition rate for word and speaker recognition system that was successfully implemented is 84.44% for word recognition while for speaker recognition is 54.44%

    An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony

    Get PDF
    In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
    corecore