Search CORE

289 research outputs found

[[alternative]]Text-Independent Speaker Identification Systems Based on Multi-Layer Gaussian Mixture Models

Author: 賴友仁
Publication venue
Publication date
Field of study

計畫編號：NSC92-2213-E032-026研究期間：200308~200407研究經費：541,000[[sponsorship]]行政院國家科學委員

Tamkang University Institutional Repository

Enhancement of a Text-Independent Speaker Verification System by using Feature Combination and Parallel-Structure Classifiers

Author: Abdalmalak Kerlos Atia
Gallardo-Antol'in Ascensión
Publication venue
Publication date: 26/01/2024
Field of study

Speaker Verification (SV) systems involve mainly two individual stages: feature extraction and classification. In this paper, we explore these two modules with the aim of improving the performance of a speaker verification system under noisy conditions. On the one hand, the choice of the most appropriate acoustic features is a crucial factor for performing robust speaker verification. The acoustic parameters used in the proposed system are: Mel Frequency Cepstral Coefficients (MFCC), their first and second derivatives (Deltas and Delta- Deltas), Bark Frequency Cepstral Coefficients (BFCC), Perceptual Linear Predictive (PLP), and Relative Spectral Transform - Perceptual Linear Predictive (RASTA-PLP). In this paper, a complete comparison of different combinations of the previous features is discussed. On the other hand, the major weakness of a conventional Support Vector Machine (SVM) classifier is the use of generic traditional kernel functions to compute the distances among data points. However, the kernel function of an SVM has great influence on its performance. In this work, we propose the combination of two SVM-based classifiers with different kernel functions: Linear kernel and Gaussian Radial Basis Function (RBF) kernel with a Logistic Regression (LR) classifier. The combination is carried out by means of a parallel structure approach, in which different voting rules to take the final decision are considered. Results show that significant improvement in the performance of the SV system is achieved by using the combined features with the combined classifiers either with clean speech or in the presence of noise. Finally, to enhance the system more in noisy environments, the inclusion of the multiband noise removal technique as a preprocessing stage is proposed

arXiv.org e-Print Archive

Time and Frequency Pruning for Speaker Identification

Author: Besacier L
Bonastre J.-F.
Publication venue: HAL CCSD
Publication date: 01/01/1998
Field of study

International audienceThis work is an attempt to refine decisions in speaker identification. A test utterance is divided into multiple time-frequency blocks on which a normalized likelihood score is calculated. Instead of averaging the block-likelihoods along the whole test utterance, some of them are rejected (pruning) and the final score is computed with a limited number of time-frequency blocks. The results obtained in the special case of time pruning lead the authors to experiment a joint time and frequency pruning approach. The optimal percentage of blocks pruned is learned on a tuning data set with the minimum identification error criterion. Validation of the time-frequency pruning process on 567 speakers leads to a significant error rate reduction (up to 41% reduction on TIMIT) for short training and test duration. ,QWURGXFWLRQ Mono-gaussian models for speaker recognition have been largely replaced by Gaussian Mixture Models (GMM) which are dedicated to modeling smaller clusters of speech. The Gaussian mixture modeling can be seen as a FRRSHUDWLRQ of models since the gaussian mixture density is a weighted linear combination of uni-modal gaussian densities. The work presented here is rather concerned with FRPSHWLWLRQ of models since different mono-gaussian models (corresponding to different frequency subbands) are applied to the test signal and the decision is made with the best or the N-best model scores. More precisely, a test utterance is divided into time-frequency blocks, each of them corresponding to a particular frequency subband and a particular time segment. During the recognition phase, the block scores are accumulated over the whole test utterance to compute a global score and take a final decision. In this work, we investigate accumulation using a hard threshold approach since some block scores are eliminated (pruning) and the final decision is taken with a subset of these scores. This approach should be robust in the case of a time-frequency localized noise since the least reliable time-frequency blocks can be removed. Even in the case of clean speech, some speaker test utterance blocks can be simply more similar to another speaker model than to the target speaker model itself. Removing these error-prone blocks should lead to a more robust decision. In 6HFWLRQ , a formalism is proposed to describe our block-based speaker recognition system. The potential of this approach is shown with a special case of the formalism: time pruning (6HFWLRQ). Experiments intended to find the optimal percentage of blocks pruned are described in 6HFWLRQ. The optimal parameters (percentage of blocks pruned) are validated on TIMIT and NTIMIT databases (6HFWLRQ). Finally, we summarize our main results and outline the potential advantages of the time-frequency pruning procedure in 6HFWLRQ .)RUPDOLVP 0RQRJDXVVLDQ µVHJPHQWDO ¶ PRGHOLQJ Let { } [ W W 0 1≤ ≤ be a sequence of M vectors resulting from the S-dimensional acoustic analysis of a speech signal uttered by speaker X. These vectors are summarized by the mean vector [ and the covariance matrix X: [ 0 [ ; 0 [ [ [

Noise robust speaker verification using mel-frequency discrete wavelet coefficients and parallel model compensation

Author: Gürbüz Sabri
Tüfekçi Zekeriya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Interfering noise severely degrades the performance of a speaker verification system. The Parallel Model Combination (PMC) technique is one of the most efficient techniques for dealing with such noise. Another method is to use features local in the frequency domain. Recently, Mel-Frequency Discrete Wavelet Coefficients (MFDWCs) [1, 2] were proposed as speech features local in frequency domain. In this paper, we discuss using PMC along with MFDWCs features to take advantage of both noise compensation and local features (MFDWCs) to decrease the effect of noise on speaker verification performance. We evaluate the performance of MFDWCs using the NIST 1998 speaker recognition and NOISEX-92 databases for various noise types and noise levels. We also compare the performance of these versus MFCCs and both using PMC for dealing with additive noise. The experimental results show significant performance improvements for MFDWCs versus MFCCs after compensating the Gaussian Mixture Models (GMMs) using the PMC technique. The MFDWCs gave 5.24 and 3.23 points performance improvement on average over MFCCs for -6 dB and 0 dB SNR values, respectively. These correspond to 26.44% and 23.73% relative reductions in equal error rate (EER), respectively

Noise-Robust Voice Conversion

Author: Tran Trang Thi Minh
Publication venue: Bucknell Digital Commons
Publication date: 04/05/2014
Field of study

A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method

Bucknell University

Localization and Selection of Speaker Specific Information with Statistical Modeling

Author: Besacier L
Bonastre J.F.
Fredouille C.
Publication venue: Elsevier : North-Holland
Publication date: 01/01/2000
Field of study

International audienceStatistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. Particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knockout) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity fo

Communications Biophysics

Author: Boduch Raymond
Braida Louis D.
Bustamante Diane K.
Coker Jackie
Colburn H. Steven
Delhorne Lorraine A.
DeRosier David J.
Dowdy Leonard C.
Downs Maralene M.
Durlach Nathaniel I.
Farrar Catherine L.
Florentine Mary S.
Foss Kristin K.
Freeman Dennis M.
Frishkopf Lawrence S.
Gabriel Kaigham J.
Gilbert Eric
Ito Yoshiko
Jain Manoj
Kiang Nelson Y-S.
Koehnke Janet A.
Leivy Sander J.
Leotta Daniel F.
Macmillan Neil A.
Oman Charles M.
Opalsky David
Peake William T.
Pemberton Joseph C.
Peterson Patrick M.
Posen Miles P.
Rabinowitz William M.
Reed Charlotte M.
Reid Jean P.
Reohr Richard D.
Rohlicek J. Robin
Russell Roy P., Jr.
Schultz Martin C.
Siebert William M.
Silletto Karen
Skarda Gregory M.
Tsuk Michael J.
Uchanski Rosalie M.
Waissman Roberto G.
Weiss Thomas F.
Zue Victor W.
Zurek Patrick M.
Publication venue: Research Laboratory of Electronics (RLE) at the Massachusetts Institute of Technology (MIT)
Publication date: 01/01/1984
Field of study

Contains reports on seven research projects split into three sections.National Institutes of Health (Grant 5 PO1 NS13126)National Institutes of Health (Grant 1 RO1 NS18682)National Institutes of Health (Training Grant 5 T32 NS07047)National Science Foundation (Grant BNS77-16861)National Institutes of Health (Grant 1 F33 NS07202-01)National Institutes of Health (Grant 5 RO1 NS10916)National Institutes of Health (Grant 5 RO1 NS12846)National Institutes of Health (Grant 1 RO1 NS16917)National Institutes of Health (Grant 1 RO1 NS14092-05)National Science Foundation (Grant BNS 77 21751)National Institutes of Health (Grant 5 R01 NS11080)National Institutes of Health (Grant GM-21189

DSpace@MIT

Electroacoustic and Behavioural Evaluation of Hearing Aid Digital Signal Processing Features

Author: Suelzle David J O
Publication venue: Scholarship@Western
Publication date: 19/04/2013
Field of study

Modern digital hearing aids provide an array of features to improve the user listening experience. As the features become more advanced and interdependent, it becomes increasingly necessary to develop accurate and cost-effective methods to evaluate their performance. Subjective experiments are an accurate method to determine hearing aid performance but they come with a high monetary and time cost. Four studies that develop and evaluate electroacoustic hearing aid feature evaluation techniques are presented. The first study applies a recent speech quality metric to two bilateral wireless hearing aids with various features enabled in a variety of environmental conditions. The study shows that accurate speech quality predictions are made with a reduced version of the original metric, and that a portion of the original metric does not perform well when applied to a novel subjective speech quality rating database. The second study presents a reference free (non-intrusive) electroacoustic speech quality metric developed specifically for hearing aid applications and compares its performance to a recent intrusive metric. The non-intrusive metric offers the advantage of eliminating the need for a shaped reference signal and can be used in real time applications but requires a sacrifice in prediction accuracy. The third study investigates the digital noise reduction performance of seven recent hearing aid models. An electroacoustic measurement system is presented that allows the noise and speech signals to be separated from hearing aid recordings. It is shown how this can be used to investigate digital noise reduction performance through the application of speech quality and speech intelligibility measures. It is also shown how the system can be used to quantify digital noise reduction attack times. The fourth study presents a turntable-based system to investigate hearing aid directionality performance. Two methods to extract the signal of interest are described. Polar plots are presented for a number of hearing aid models from recordings generated in both the free-field and from a head-and-torso simulator. It is expected that the proposed electroacoustic techniques will assist Audiologists and hearing researchers in choosing, benchmarking, and fine-tuning hearing aid features

Scholarship@Western

A Review of Deep Learning Techniques for Speech Processing

Author: Bhardwaj Rishabh
Majumder Navonil
Mehrish Ambuj
Mihalcea Rada
Poria Soujanya
Publication venue
Publication date: 01/05/2023
Field of study

The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

arXiv.org e-Print Archive