957 research outputs found

    Bayesian Speaker Adaptation Based on a New Hierarchical Probabilistic Model

    Get PDF
    In this paper, a new hierarchical Bayesian speaker adaptation method called HMAP is proposed that combines the advantages of three conventional algorithms, maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and eigenvoice, resulting in excellent performance across a wide range of adaptation conditions. The new method efficiently utilizes intra-speaker and inter-speaker correlation information through modeling phone and speaker subspaces in a consistent hierarchical Bayesian way. The phone variations for a specific speaker are assumed to be located in a low-dimensional subspace. The phone coordinate, which is shared among different speakers, implicitly contains the intra-speaker correlation information. For a specific speaker, the phone variation, represented by speaker-dependent eigenphones, are concatenated into a supervector. The eigenphone supervector space is also a low dimensional speaker subspace, which contains inter-speaker correlation information. Using principal component analysis (PCA), a new hierarchical probabilistic model for the generation of the speech observations is obtained. Speaker adaptation based on the new hierarchical model is derived using the maximum a posteriori criterion in a top-down manner. Both batch adaptation and online adaptation schemes are proposed. With tuned parameters, the new method can handle varying amounts of adaptation data automatically and efficiently. Experimental results on a Mandarin Chinese continuous speech recognition task show good performance under all testing conditions

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Porting concepts from DNNs back to GMMs

    Get PDF
    Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination

    Acta Cybernetica : Volume 17. Number 2.

    Get PDF

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    Αναγνώριση Συναισθήματος με χρήση Βαθιάς Μάθησης και Πρωτότυπων Τεχνικών Επαύξησης Δεδομένων

    Get PDF
    υγκεκριμένα task. Γενικά, το συναίσθημα ενός ανθρώπου αναγνωρίζεται αναλύοντας εκφράσεις του προσώπου, χειρονομίες, τη στάση του σώματος, την ομιλία ή φυσιολογικές παραμέτρους όπως αυτές προκύπτουν από ηλεκτροεγκεφαλογραφήματα, ηλεκτροκαρδιογραφήματα κα. Ωστόσο, σε πολλές περιπτώσεις οι οπτικές πληροφορίες δεν διαθέσιμες ή κατάλληλες, ενώ η μέτρηση των φυσιολογικών παραμέτρων είναι δύσκολη, δύσχρηστη και απαιτεί εξειδικευμένο ακριβό εξοπλισμό. Συνεπώς, η ομιλία ίσως είναι η καλύτερη εναλλακτική. Οι συνηθισμένες τεχνικές μηχανικής μάθησης που χρησιμοποιούνται για το σκοπό αυτό εξάγουν ένα σύνολο γλωσσολογικών χαρακτηριστικών από τα δεδομένα, τα οποία χρησιμοποιούνται στη συνέχεια για την εκπαίδευση μοντέλων επιβλεπόμενης μάθησης (supervised learning). Στη διπλωματική αυτή χρησιμοποιείται ένα μοντέλο Συνελικτικού Νευρωνικού Δικτύου (Convolution Neural Network - CNN) που σε αντίθεση με τις παραδοσιακές προσεγγίσεις ανιχνεύει μόνο τα σημαντικά χαρακτηριστικά των δεδομένων που εισάγονται σε αυτό. Αξίζει να σημειωθεί, πως η αρχιτεκτονική ενός CNN είναι ανάλογη με τη συνδεσιμότητα των νευρώνων του ανθρώπινου εγκεφάλου και εμπνευσμένη από την οργάνωση του οπτικού φλοιού. Χρησιμοποιούνται τρια σύνολα ηχητικών δεδομένων (EMOVO, SAVEE, Emo-DB), από όπου εξάγονται τα αντίστοιχα φασματογραφηματα (spectrograms), τα οποία με τη σειρά τους χρησιμοποιούνται ως είσοδοι στο νευρωνικό δίκτυο. Για τη βέλτιστη απόδοση του αλγορίθμου εφαρμόζονται πρωτότυπες τεχνικές επαύξησης (data augmentation) των αρχικών δεδομένων πέραν της συνηθισμένης πρόσθεσης noise, όπως μετατόπιση του ηχητικού σήματος, αλλαγή της οξύτητας και της ταχύτητας του. Τέλος, χρησιμοποιούνται μέθοδοι καταπολέμησης της υπερπροσαρμογής (overfitting) όπως το dropout και τεχνικές ενίσχυσης της γενικευσιμότητας του μοντέλου όπως πρόσθεση επιπέδων κανονικοποίησης τοπικής απόκρισης (local response normalization layers), η λειτουργία των οποίων είναι εμπνευσμένη από την πλευρική αναστολή (lateral inhibition) των νευρώνων του εγκεφάλου. Τα αποτελέσματα είναι βελτιωμένα σε σχέση με άλλες παρόμοιες μελέτες. Ωστόσο, το μοντέλο δεν υποδεικνύει ανεξαρτησία από τη γλώσσα των ηχητικών σημάτων.Emotion recognition is quite important for various applications related to human-computer interaction or for understanding the user's mood in specific tasks. In general, a person's emotion is recognized by analyzing facial expressions, gestures, posture, speech or physiological parameters such as those occurring from electroencephalograms, electrocardiograms, etc. However, in many cases, the visual information is not available or appropriate, while the measurement of physiological parameters is difficult and requires specialized, expensive equipment. As a result, speech is probably the best alternative. The typical machine learning techniques used for this purpose extract a set of linguistic features from the data, which are then used to train supervised learning models. In this thesis, a Convolution Neural Network (CNN) is proposed, which, unlike traditional approaches, detects only the important features of raw data entered into it. It is worth noting that the architecture of a CNN is analogous to the connectivity of the neurons of the human brain and inspired by the organization of the visual cortex. The inputs to the neural network are the spectrograms that are extracted from audio signals. For the optimal performance of the algorithm, data augmentation techniques of the original data are applied such as adding noise, shifting of the audio signal, and changing its pitch or its speed. Finally, methods against overfitting are applied, such as dropout and local response normalization layers, the operation of which is inspired by lateral inhibition of the neurons of the human brain. Our approach outperformed previous work, without being established as a considerably language-independent one

    Generative Adversarial Network with Convolutional Wavelet Packet Transforms for Automated Speaker Recognition and Classification

    Get PDF
    Speech is an effective mode of communication that always conveys abundant and pertinent information, such as the gender, accent, and other distinguishing characteristics of the speaker. These distinctive characteristics allow researchers to identify human voices using artificial intelligence (AI) techniques, which are useful for forensic voice verification, security and surveillance, electronic voice eavesdropping, mobile banking, and mobile purchasing. Deep learning (DL) and other advances in hardware have piqued the interest of researchers studying automatic speaker identification (SI). In recent years, Generative Adversarial Networks (GANs) have demonstrated exceptional ability in producing synthetic data and improving the performance of several machine learning tasks. The capacity of Convolutional Wavelet Packet Transform (CWPT) and Generative Adversarial Networks are combined in this paper to propose a novel way of enhancing the accuracy and robustness of Speaker Recognition and Classification systems. Audio signals are dissected using the Convolutional Wavelet Packet Transform into a multi-resolution, time-frequency representation that faithfully preserves local and global characteristics. The improved audio features better precisely describe speech traits and handle pitch, tone, and pronunciation variations that are frequent in speaker recognition tasks. Using GANs to create synthetic speech samples, our suggested method GAN-CWPT enriches the training data and broadens the dataset's diversity. The generator and discriminator components of the GAN architecture have been tweaked to produce realistic speech samples with attributes quite similar to genuine speaker utterances. The new dataset enhances the Speaker Recognition and Classification system's robustness and generalization, even in environments with little training data. We conduct extensive tests on standard speaker recognition datasets to determine how well our method works. The findings demonstrate that, compared to conventional methods, the GAN-CWPTs combination significantly improves speaker recognition, classification accuracy, and efficiency. Additionally, the suggested model GAN-CWPT exhibits stronger generalization on unknown speakers and excels even with loud and poor audio inputs

    Enhancing dysarthria speech feature representation with empirical mode decomposition and Walsh-Hadamard transform

    Full text link
    Dysarthria speech contains the pathological characteristics of vocal tract and vocal fold, but so far, they have not yet been included in traditional acoustic feature sets. Moreover, the nonlinearity and non-stationarity of speech have been ignored. In this paper, we propose a feature enhancement algorithm for dysarthria speech called WHFEMD. It combines empirical mode decomposition (EMD) and fast Walsh-Hadamard transform (FWHT) to enhance features. With the proposed algorithm, the fast Fourier transform of the dysarthria speech is first performed and then followed by EMD to get intrinsic mode functions (IMFs). After that, FWHT is used to output new coefficients and to extract statistical features based on IMFs, power spectral density, and enhanced gammatone frequency cepstral coefficients. To evaluate the proposed approach, we conducted experiments on two public pathological speech databases including UA Speech and TORGO. The results show that our algorithm performed better than traditional features in classification. We achieved improvements of 13.8% (UA Speech) and 3.84% (TORGO), respectively. Furthermore, the incorporation of an imbalanced classification algorithm to address data imbalance has resulted in a 12.18% increase in recognition accuracy. This algorithm effectively addresses the challenges of the imbalanced dataset and non-linearity in dysarthric speech and simultaneously provides a robust representation of the local pathological features of the vocal folds and tracts
    corecore