149 research outputs found

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    ABSP System for The Third DIHARD Challenge

    Get PDF
    This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optimize the parameters for PCA-based dimensionality reduction in a domain-dependent way. Our method of integrating domain-based processing schemes in the baseline system of the challenge achieved a relative improvement of 9.63% and 10.64% in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set

    On the development of an automatic voice pleasantness classification and intensity estimation system

    Get PDF
    In the last few years, the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems, the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how pleasant is a voice from a perceptual point of view when the final application is a speech based interface. In this paper we present an objective definition for voice pleasantness based on the composition of a representative feature subset and a new automatic voice pleasantness classification and intensity estimation system. Our study is based on a database composed by European Portuguese female voices but the methodology can be extended to male voices or to other languages. In the objective performance evaluation the system achieved a 9.1% error rate for voice pleasantness classification and a 15.7% error rate for voice pleasantness intensity estimation.Work partially supported by ERDF funds, the Spanish Government (TEC2009-14094-C04-04), and Xunta de Galicia (CN2011/019, 2009/062

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    Mobile Biometry (MOBIO) Face and Speaker Verification Evaluation

    Get PDF
    This paper evaluates the performance of face and speaker verification techniques in the context of a mobile environment. The mobile environment was chosen as it provides a realistic and challenging test-bed for biometric person verification techniques to operate. For instance the audio environment is quite noisy and there is limited control over the illumination conditions and the pose of the subject for the video. To conduct this evaluation, a part of a database captured during the ``Mobile Biometry'' (MOBIO) European Project was used. In total there were nine participants to the evaluation who submitted a face verification system and five participants who submitted speaker verification systems. The nine face verification systems all varied significantly in terms of both verification algorithms and face detection algorithms. Several systems used the OpenCV face detector while the better systems used proprietary software for the task of face detection. This ended up making the evaluation of verification algorithms challenging. The five speaker verification systems were based on one of two paradigms: a Gaussian Mixture Model (GMM) or Support Vector Machine (SVM) paradigm. In general the systems based on the SVM paradigm performed better than those based on the GMM paradigm

    Alzheimer's dementia recognition from spontaneous speech using deep neural networks

    Get PDF
    Tato práce je zaměřena na výzvu ADReSS (Alzheimer's Dementia Recognition through Spontaneous Speech) z konference INTERSPEECH 2020. K řešení této výzvy byly použity různé přístupy k dosažení základních výsledků pro klasifikační a regresní úlohy. V rámci předzpracování dat bylo nutné provést extrakci příznaků pro akustická a lingvistická data. Byly použity předtrénované modely: příznaky ze zvukového záznamu byly extrahovány modelem SpeechBrain pro verifikaci mluvčích založeným na Time-Delay Neural Network (TDNN) a příznaky z přepisů byly extrahovány modelem Bidirectional Encoder Representations from Transformers (BERT). První část této práce se zaměřuje na vývoj klasifikačního modelu pro rozpoznávání Alzheimerovy choroby (AD). Výsledky ukazují, že model neuronové sítě dosahuje nejvyšší klasifikační přesnosti 85 % na dané testovací množině s použitím transkripcí a překonává základní model o 10 % pro lingvistická data. Model K-Nearest Neighbour (KNN) dosáhl přesnosti 71 % pro akustická data, což je o 14 % více než základní výsledek. Druhá část studie se zaměřuje na vývoj regresního modelu pro odhad skóre Mini-Mental State Examination (MMSE). Modely jsou hodnoceny pomocí statistických ukazatelů, jako je střední kvadratická chyba (RMSE) a hodnoty R-squared (r2). Výsledky ukazují, že model ElasticNet dosahuje nejnižší hodnoty RMSE 4,35 a překonává základní model o 0,85 bodu. U obou úloh dosažené výsledky překonaly nejlepší známé výsledky pro úlohu ADReSS. Závěrem lze říci, že tato práce prokazuje účinnost modelů strojového učení pro klasifikaci AD a predikci skóre MMSE. Výsledky ukazují potenciál těchto modelů pomáhat při včasné detekci a sledování AD a poskytují poznatky o kvalitě datového setu.This thesis is focused on ADReSS (Alzheimer's Dementia Recognition through Spontaneous Speech) challenge at INTERSPEECH 2020. To solve this challenge different approaches were used to achieve baseline results for classification and regression tasks. As a part of data preprocessing, feature extraction was needed for acoustic and linguistic data. Pretrained models were used: features from audio recording were extracted by SpeechBrain speaker verification model based on Time-Delay Neural Network (TDNN) and features from transcriptions were extracted by Bidirectional Encoder Representations from Transformers (BERT) model. The first part of this work focuses on developing a classification model to recognise Alzheimer's disease (AD). The results show that the Neural Network model achieves the highest classification accuracy of 85% on the given testing set using transcriptions, outperforming the baseline model by 10% for transcriptions. For speech recording, K-Nearest Neighbour (KNN) has achieved test accuracy of 71%, which is higher than the baseline result by 14%. The second part of the study focuses on developing a regression model for predicting Mini-Mental State Examination (MMSE) scores. The models are evaluated using performance metrics, such as root mean squared error (RMSE) and R-squared (r2) values. The results show that the ElasticNet model achieves the lowest RMSE of 4.35, outperforming the baseline model by 0.85. For both tasks, achieved results have outperformed the best-known results for the ADReSS challenge. In conclusion, this thesis demonstrates the effectiveness of machine learning models for the classification of AD and the prediction of MMSE scores. The results highlight the potential for these models to assist in the early detection and monitoring of AD, and provide insights about dataset quality

    Event sequence metric learning

    Full text link
    In this paper we consider a challenging problem of learning discriminative vector representations for event sequences generated by real-world users. Vector representations map behavioral client raw data to the low-dimensional fixed-length vectors in the latent space. We propose a novel method of learning those vector embeddings based on metric learning approach. We propose a strategy of raw data subsequences generation to apply a metric learning approach in a fully self-supervised way. We evaluated the method over several public bank transactions datasets and showed that self-supervised embeddings outperform other methods when applied to downstream classification tasks. Moreover, embeddings are compact and provide additional user privacy protection

    A Soft Computing Based Approach for Multi-Accent Classification in IVR Systems

    Get PDF
    A speaker's accent is the most important factor affecting the performance of Natural Language Call Routing (NLCR) systems because accents vary widely, even within the same country or community. This variation also occurs when non-native speakers start to learn a second language, the substitution of native language phonology being a common process. Such substitution leads to fuzziness between the phoneme boundaries and phoneme classes, which reduces out-of-class variations, and increases the similarities between the different sets of phonemes. Thus, this fuzziness is the main cause of reduced NLCR system performance. The main requirement for commercial enterprises using an NLCR system is to have a robust NLCR system that provides call understanding and routing to appropriate destinations. The chief motivation for this present work is to develop an NLCR system that eliminates multilayered menus and employs a sophisticated speaker accent-based automated voice response system around the clock. Currently, NLCRs are not fully equipped with accent classification capability. Our main objective is to develop both speaker-independent and speaker-dependent accent classification systems that understand a caller's query, classify the caller's accent, and route the call to the acoustic model that has been thoroughly trained on a database of speech utterances recorded by such speakers. In the field of accent classification, the dominant approaches are the Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). Of the two, GMM is the most widely implemented for accent classification. However, GMM performance depends on the initial partitions and number of Gaussian mixtures, both of which can reduce performance if poorly chosen. To overcome these shortcomings, we propose a speaker-independent accent classification system based on a distance metric learning approach and evolution strategy. This approach depends on side information from dissimilar pairs of accent groups to transfer data points to a new feature space where the Euclidean distances between similar and dissimilar points are at their minimum and maximum, respectively. Finally, a Non-dominated Sorting Evolution Strategy (NSES)-based k-means clustering algorithm is employed on the training data set processed by the distance metric learning approach. The main objectives of the NSES-based k-means approach are to find the cluster centroids as well as the optimal number of clusters for a GMM classifier. In the case of a speaker-dependent application, a new method is proposed based on the fuzzy canonical correlation analysis to find appropriate Gaussian mixtures for a GMM-based accent classification system. In our proposed method, we implement a fuzzy clustering approach to minimize the within-group sum-of-square-error and canonical correlation analysis to maximize the correlation between the speech feature vectors and cluster centroids. We conducted a number of experiments using the TIMIT database, the speech accent archive, and the foreign accent English databases for evaluating the performance of speaker-independent and speaker-dependent applications. Assessment of the applications and analysis shows that our proposed methodologies outperform the HMM, GMM, vector quantization GMM, and radial basis neural networks
    corecore