433 research outputs found

    Prosody Modification using Allpass Residual of Speech Signals

    Get PDF
    In this paper, we attempt to signify the role of phase spectrum of speech signals in acquiring an accurate estimate of excitation source for prosody modification. The phase spectrum is parametrically modeled as the response of an all pass (AP) filter, and the filter coefficients are estimated by considering the linear prediction (LP) residual as the output of the AP filter. The resultant residual signal, namely AP residual, exhibits unambiguous peaks corresponding to epochs, which are chosen as pitch markers for prosody modification. This strategy efficiently removes ambiguities associated with pitch marking, required for pitch synchronous overlap-add (PSOLA) method. The prosody modification using AP residual is advantageous than time domain PSOLA (TD-PSOLA) using speech signals, as it offers fewer distortions due to its flat magnitude spectrum. Windowing centered around unambiguous peaks in AP residual is used for segmentation, followed by pitch/duration modification of AP residual by mapping of pitch markers. The modified speech signal is obtained from modified AP residual using synthesis filters. The mean opinion scores are used for performance evaluation of the proposed method, and it is observed that the AP residual-based method delivers equivalent performance as that of LP residual based method using epochs, and better performance than the linear prediction PSOLA (LP-PSOLA)

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Prosody Modifications for Voice Conversion

    Get PDF
    Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented. Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch based approach.In order to apply desired speech modifications TD-PSOLA discussed in this thesis works directly on speech in time domian although there are many variations of TD-PSOLA.The epoch based approach involves modifications of LP-residual

    An autopoietic approach to the development of speech recognition (pendekatan autopoietic dalam pembangunan pengecaman suara)

    Get PDF
    The focus of research here is on the implementation of speech recognition through an autopoietic approach. The work done here has culminated in the introduction of a neural network architecture named Homunculus Network. This network was used in the development of a speech recognition system for Bahasa Melayu. The speech recognition system is an isolated-word, phoneme-level speech recognizer that is speaker independent and has a vocabulary of 15 words. The research done has identified some issues worth further work later. These issues are also the basis for the design and the development of the new autopoietic speech recognition system

    Robust speaker recognition using both vocal source and vocal tract features estimated from noisy input utterances.

    Get PDF
    Wang, Ning.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 106-115).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Introduction to Speech and Speaker Recognition --- p.1Chapter 1.2 --- Difficulties and Challenges of Speaker Authentication --- p.6Chapter 1.3 --- Objectives and Thesis Outline --- p.7Chapter 2 --- Speaker Recognition System --- p.10Chapter 2.1 --- Baseline Speaker Recognition System Overview --- p.10Chapter 2.1.1 --- Feature Extraction --- p.12Chapter 2.1.2 --- Pattern Generation and Classification --- p.24Chapter 2.2 --- Performance Evaluation Metric for Different Speaker Recognition Tasks --- p.30Chapter 2.3 --- Robustness of Speaker Recognition System --- p.30Chapter 2.3.1 --- Speech Corpus: CU2C --- p.30Chapter 2.3.2 --- Noise Database: NOISEX-92 --- p.34Chapter 2.3.3 --- Mismatched Training and Testing Conditions --- p.35Chapter 2.4 --- Summary --- p.37Chapter 3 --- Speaker Recognition System using both Vocal Tract and Vocal Source Features --- p.38Chapter 3.1 --- Speech Production Mechanism --- p.39Chapter 3.1.1 --- Speech Production: An Overview --- p.39Chapter 3.1.2 --- Acoustic Properties of Human Speech --- p.40Chapter 3.2 --- Source-filter Model and Linear Predictive Analysis --- p.44Chapter 3.2.1 --- Source-filter Speech Model --- p.44Chapter 3.2.2 --- Linear Predictive Analysis for Speech Signal --- p.46Chapter 3.3 --- Vocal Tract Features --- p.51Chapter 3.4 --- Vocal Source Features --- p.52Chapter 3.4.1 --- Source Related Features: An Overview --- p.52Chapter 3.4.2 --- Source Related Features: Technical Viewpoints --- p.54Chapter 3.5 --- Effects of Noises on Speech Properties --- p.55Chapter 3.6 --- Summary --- p.61Chapter 4 --- Estimation of Robust Acoustic Features for Speaker Discrimination --- p.62Chapter 4.1 --- Robust Speech Techniques --- p.63Chapter 4.1.1 --- Noise Resilience --- p.64Chapter 4.1.2 --- Speech Enhancement --- p.64Chapter 4.2 --- Spectral Subtractive-Type Preprocessing --- p.65Chapter 4.2.1 --- Noise Estimation --- p.66Chapter 4.2.2 --- Spectral Subtraction Algorithm --- p.66Chapter 4.3 --- LP Analysis of Noisy Speech --- p.67Chapter 4.3.1 --- LP Inverse Filtering: Whitening Process --- p.68Chapter 4.3.2 --- Magnitude Response of All-pole Filter in Noisy Condition --- p.70Chapter 4.3.3 --- Noise Spectral Reshaping --- p.72Chapter 4.4 --- Distinctive Vocal Tract and Vocal Source Feature Extraction . . --- p.73Chapter 4.4.1 --- Vocal Tract Feature Extraction --- p.73Chapter 4.4.2 --- Source Feature Generation Procedure --- p.75Chapter 4.4.3 --- Subband-specific Parameterization Method --- p.79Chapter 4.5 --- Summary --- p.87Chapter 5 --- Speaker Recognition Tasks & Performance Evaluation --- p.88Chapter 5.1 --- Speaker Recognition Experimental Setup --- p.89Chapter 5.1.1 --- Task Description --- p.89Chapter 5.1.2 --- Baseline Experiments --- p.90Chapter 5.1.3 --- Identification and Verification Results --- p.91Chapter 5.2 --- Speaker Recognition using Source-tract Features --- p.92Chapter 5.2.1 --- Source Feature Selection --- p.92Chapter 5.2.2 --- Source-tract Feature Fusion --- p.94Chapter 5.2.3 --- Identification and Verification Results --- p.95Chapter 5.3 --- Performance Analysis --- p.98Chapter 6 --- Conclusion --- p.102Chapter 6.1 --- Discussion and Conclusion --- p.102Chapter 6.2 --- Suggestion of Future Work --- p.10

    Apprentissage automatique pour le codage cognitif de la parole

    Get PDF
    Depuis les années 80, les codecs vocaux reposent sur des stratégies de codage à court terme qui fonctionnent au niveau de la sous-trame ou de la trame (généralement 5 à 20 ms). Les chercheurs ont essentiellement ajusté et combiné un nombre limité de technologies disponibles (transformation, prédiction linéaire, quantification) et de stratégies (suivi de forme d'onde, mise en forme du bruit) pour construire des architectures de codage de plus en plus complexes. Dans cette thèse, plutôt que de s'appuyer sur des stratégies de codage à court terme, nous développons un cadre alternatif pour la compression de la parole en codant les attributs de la parole qui sont des caractéristiques perceptuellement importantes des signaux vocaux. Afin d'atteindre cet objectif, nous résolvons trois problèmes de complexité croissante, à savoir la classification, la prédiction et l'apprentissage des représentations. La classification est un élément courant dans les conceptions de codecs modernes. Dans un premier temps, nous concevons un classifieur pour identifier les émotions, qui sont parmi les attributs à long terme les plus complexes de la parole. Dans une deuxième étape, nous concevons un prédicteur d'échantillon de parole, qui est un autre élément commun dans les conceptions de codecs modernes, pour mettre en évidence les avantages du traitement du signal de parole à long terme et non linéaire. Ensuite, nous explorons les variables latentes, un espace de représentations de la parole, pour coder les attributs de la parole à court et à long terme. Enfin, nous proposons un réseau décodeur pour synthétiser les signaux de parole à partir de ces représentations, ce qui constitue notre dernière étape vers la construction d'une méthode complète de compression de la parole basée sur l'apprentissage automatique de bout en bout. Bien que chaque étape de développement proposée dans cette thèse puisse faire partie d'un codec à elle seule, chaque étape fournit également des informations et une base pour la prochaine étape de développement jusqu'à ce qu'un codec entièrement basé sur l'apprentissage automatique soit atteint. Les deux premières étapes, la classification et la prédiction, fournissent de nouveaux outils qui pourraient remplacer et améliorer des éléments des codecs existants. Dans la première étape, nous utilisons une combinaison de modèle source-filtre et de machine à état liquide (LSM), pour démontrer que les caractéristiques liées aux émotions peuvent être facilement extraites et classées à l'aide d'un simple classificateur. Dans la deuxième étape, un seul réseau de bout en bout utilisant une longue mémoire à court terme (LSTM) est utilisé pour produire des trames vocales avec une qualité subjective élevée pour les applications de masquage de perte de paquets (PLC). Dans les dernières étapes, nous nous appuyons sur les résultats des étapes précédentes pour concevoir un codec entièrement basé sur l'apprentissage automatique. un réseau d'encodage, formulé à l'aide d'un réseau neuronal profond (DNN) et entraîné sur plusieurs bases de données publiques, extrait et encode les représentations de la parole en utilisant la prédiction dans un espace latent. Une approche d'apprentissage non supervisé basée sur plusieurs principes de cognition est proposée pour extraire des représentations à partir de trames de parole courtes et longues en utilisant l'information mutuelle et la perte contrastive. La capacité de ces représentations apprises à capturer divers attributs de la parole à court et à long terme est démontrée. Enfin, une structure de décodage est proposée pour synthétiser des signaux de parole à partir de ces représentations. L'entraînement contradictoire est utilisé comme une approximation des mesures subjectives de la qualité de la parole afin de synthétiser des échantillons de parole à consonance naturelle. La haute qualité perceptuelle de la parole synthétisée ainsi obtenue prouve que les représentations extraites sont efficaces pour préserver toutes sortes d'attributs de la parole et donc qu'une méthode de compression complète est démontrée avec l'approche proposée.Abstract: Since the 80s, speech codecs have relied on short-term coding strategies that operate at the subframe or frame level (typically 5 to 20ms). Researchers essentially adjusted and combined a limited number of available technologies (transform, linear prediction, quantization) and strategies (waveform matching, noise shaping) to build increasingly complex coding architectures. In this thesis, rather than relying on short-term coding strategies, we develop an alternative framework for speech compression by encoding speech attributes that are perceptually important characteristics of speech signals. In order to achieve this objective, we solve three problems of increasing complexity, namely classification, prediction and representation learning. Classification is a common element in modern codec designs. In a first step, we design a classifier to identify emotions, which are among the most complex long-term speech attributes. In a second step, we design a speech sample predictor, which is another common element in modern codec designs, to highlight the benefits of long-term and non-linear speech signal processing. Then, we explore latent variables, a space of speech representations, to encode both short-term and long-term speech attributes. Lastly, we propose a decoder network to synthesize speech signals from these representations, which constitutes our final step towards building a complete, end-to-end machine-learning based speech compression method. The first two steps, classification and prediction, provide new tools that could replace and improve elements of existing codecs. In the first step, we use a combination of source-filter model and liquid state machine (LSM), to demonstrate that features related to emotions can be easily extracted and classified using a simple classifier. In the second step, a single end-to-end network using long short-term memory (LSTM) is shown to produce speech frames with high subjective quality for packet loss concealment (PLC) applications. In the last steps, we build upon the results of previous steps to design a fully machine learning-based codec. An encoder network, formulated using a deep neural network (DNN) and trained on multiple public databases, extracts and encodes speech representations using prediction in a latent space. An unsupervised learning approach based on several principles of cognition is proposed to extract representations from both short and long frames of data using mutual information and contrastive loss. The ability of these learned representations to capture various short- and long-term speech attributes is demonstrated. Finally, a decoder structure is proposed to synthesize speech signals from these representations. Adversarial training is used as an approximation to subjective speech quality measures in order to synthesize natural-sounding speech samples. The high perceptual quality of synthesized speech thus achieved proves that the extracted representations are efficient at preserving all sorts of speech attributes and therefore that a complete compression method is demonstrated with the proposed approach
    corecore