56 research outputs found
Unified Framework of Feature Based Adaptation for Statistical Speech Synthesis and Recognition
The advent of statistical parametric speech synthesis has paved new ways to a unified framework for hidden Markov model (HMM) based text to speech synthesis (TTS) and automatic speech recognition (ASR). The techniques and advancements made in the field of ASR can now be adopted in the domain of synthesis. Speaker adaptation is a well-advanced topic in the area of ASR, where the adaptation data from a target speaker is used to transform the canonical model parameters to represent a speaker specific model. Feature adaptation techniques like vocal tract length normalization (VTLN) perform the same task by transforming the features ; this can be shown to be equivalent to model transformation. The main advantage of VTLN is that it can demonstrate noticeable improvements in performance with very little adaptation data and can be classified as a rapid adaptation technique. VTLN is a widely used technique in ASR, and can be used in TTS to improve the rapid adaptation performance. In TTS, the task is to synthesize speech that sounds like a particular target speaker. Using VTLN for TTS is found to make the output synthesized speech sound quite similar to the target speaker from his very first utterance. An all-pass filter based bilinear transform was implemented for the mel-generalized cepstral (MGCEP) features of the HMM-based speech synthesis system (HTS). The initial implementation was using a grid search approach that selects the best warping factor for the speech spectrum from a grid of available values using the maximum likelihood criterion. VTLN was shown to give performance improvements in the rapid adaptation framework where the number of adaptation sentences from the target speaker was limited. But, this technique involves huge time and space complexities and the rapid adaptation demands for an efficient implementation of the VTLN technique. To this end, an efficient expectation maximization (EM) based VTLN approach was implemented for HTS using Brent’s search. Unlike the ASR features, MGCEP does not use a filter bank (in order to facilitate the speech reconstruction) and this provides equivalence to the model transformation for the EM implementation. This facilitates the estimation of warping factors to be embedded in the HMM training using the same sufficient statistics as in constrained maximum likelihood linear regression (CMLLR). This work addresses a lot of challenges faced in the process of adopting VTLN for synthesis due to the higher dimensionality of the cepstral features used in the TTS models. The main idea was to unify the theory and practise in the implementation of VTLN for both ASR and TTS. Several techniques have been proposed in this thesis, in order to find the best feasible warping factor estimation procedure. Estimation of the warping factor using the lower order cepstral features representing the spectral envelope is demonstrated to be the best approach. Different evaluations on standard databases are performed in this work to illustrate the performance improvements and perceptual challenges involved in the VTLN adaptation. VTLN has only a single parameter to represent the speaker characteristics and hence, has the limitation of not scaling to the performance of other linear transform based adaptation methods with the availability of large amounts of adaptation data. Several techniques are demonstrated in this work to combine the model based adaptation like constrained structural maximum a posteriori linear regression (CSMAPLR) with VTLN, one such technique being using VTLN as the prior transform at the root node of the tree structure of the CSMAPLR system. Thus, along with rapid adaptation, the performance scales with the availability of more adaptation data. These techniques although developed for TTS, can also be effectively used in ASR. It was also shown to give improvements in ASR especially for scenarios like noisy speech conditions. Other improvements to rapid adaptation including a bias term for VTLN, multiple transform based VTLN using regression classes and VTLN prior for non-structural MAPLR adaptation are also proposed. These techniques also demonstrated both ASR and TTS performance improvements. Also, a few special scenarios, specifically cross-lingual speech, cross-gender speech, child speech and noisy speech evaluations are presented where the rapid adaptation methods presented in this work was shown to be highly beneficial. Most of these methods will be published as extensions to the open-source HTS toolkit
Self-supervised approach for Urban Tree Recognition on Aerial Images
In the light of Artificial Intelligence aiding modern society in tackling climate change, this research looks at how to detect vegetation from aerial view images using deep learning models. This task is part of a proposed larger framework to build an eco-system to monitor air quality and the related factors like weather, transport, and vegetation, as the number of trees for any urban city in the world. The challenge involves building or adapting the tree recognition models to a new city with minimum or no labeled data. This paper explores self-supervised approaches to this problem and comes up with a system with 0.89 mean average precision on the Google Earth images for Cambridge city
Speech Emotion Recognition Using Attention Model
Speech emotion recognition is an important research topic that can help to maintain and improve public health and contribute towards the ongoing progress of healthcare technology. There have been several advancements in the field of speech emotion recognition systems including the use of deep learning models and new acoustic and temporal features. This paper proposes a self-attention-based deep learning model that was created by combining a two-dimensional Convolutional Neural Network (CNN) and a long short-term memory (LSTM) network. This research builds on the existing literature to identify the best-performing features for this task with extensive experiments on different combinations of spectral and rhythmic information. Mel Frequency Cepstral Coefficients (MFCCs) emerged as the best performing features for this task. The experiments were performed on a customised dataset that was developed as a combination of RAVDESS, SAVEE, and TESS datasets. Eight states of emotions (happy, sad, angry, surprise, disgust, calm, fearful, and neutral) were detected. The proposed attention-based deep learning model achieved an average test accuracy rate of 90%, which is a substantial improvement over established models. Hence, this emotion detection model has the potential to improve automated mental health monitoring
Syllabic Pitch Tuning for Neutral-to-Emotional Voice Conversion
Prosody plays an important role in both identification and synthesis of emotionalized speech. Prosodic features like pitch are usually estimated and altered at a segmental level based on short windows of speech (where the signal is expected to be quasi-stationary). This results in a frame-wise change of acoustical parameters for synthesizing emotionalized speech. In order to convert a neutral speech to an emotional speech from the same user, it might be better to alter the pitch parameters at the suprasegmental level like at the syllable-level since the changes in the signal are more subtle and smooth. In this paper we aim to show that the pitch transformation in a neutral-to-emotional voice conversion system may result in a better speech quality output if the transformations are performed at the supra-segmental (syllable) level rather than a frame-level change. Subjective evaluation results are shown to demonstrate if the naturalness, speaker similarity and the emotion recognition tasks show any performance difference
Speech recognition with speech synthesis models by marginalising over decision tree leaves
There has been increasing interest in the use of unsupervised adaptation for the personalisation of text-to-speech (TTS) voices, particularly in the context of speech-to-speech translation. This requires that we are able to generate adaptation transforms from the output of an automatic speech recognition (ASR) system. An approach that utilises unified ASR and TTS models would seem to offer an ideal mechanism for the application of unsupervised adaptation to TTS since transforms could be shared between ASR and TTS. Such unified models should use a common set of parameters. A major barrier to such parameter sharing is the use of differing contexts in ASR and TTS. In this paper we propose a simple approach that generates ASR models from a trained set of TTS models by marginalising over the TTS contexts that are not used by ASR. We present preliminary results of our proposed method on a large vocabulary speech recognition task and provide insights into future directions of this work
Vocal Tract Length Normalization for Statistical Parametric Speech Synthesis
Vocal tract length normalization (VTLN) has been successfully used in automatic speech recognition for improved performance. The same technique can be implemented in statistical parametric speech synthesis for rapid speaker adaptation during synthesis. This paper presents an efficient implementation of VTLN using expectation maximization and addresses the key challenges faced in implementing VTLN for synthesis. Jacobian normalization, high dimensionality features and truncation of the transformation matrix are a few challenges presented with the appropriate solutions. Detailed evaluations are performed to estimate the most suitable technique for using VTLN in speech synthesis. Evaluating VTLN in the framework of speech synthesis is also not an easy task since the technique does not work equally well for all speakers. Speakers have been selected based on different objective and subjective criteria to demonstrate the difference between systems. The best method for implementing VTLN is confirmed to be use of the lower order features for estimating warping factors
Forest Terrain Identification using Semantic Segmentation on UAV Images
Beavers' habitat is known to alter the terrain, providing biodiversity in the area, and recently their lifestyle is linked to climatic changes by reducing greenhouse gases levels in the region. To analyse the impact of beavers’ habitat on the region, it is, therefore, necessary to estimate the terrain alterations caused by beaver actions. Furthermore, such terrain analysis can also play an important role in domains like wildlife ecology, deforestation, land-cover estimations, and geological mapping. Deep learning models are known to provide better estimates on automatic feature identification and classification of a terrain. However, such models require significant training data. Pre-existing terrain datasets (both real and synthetic) like CityScapes, PASCAL, UAVID, etc, are mostly concentrated on urban areas and include roads, pathways, buildings, etc. Such datasets, therefore, are unsuitable for forest terrain analysis. This paper contributes, by providing a finely labelled novel dataset of forest imagery around beavers’ habitat, captured from a high-resolution camera on an aerial drone. The dataset consists of 100 such images labelled and classified based on 9 different classes. Furthermore, a baseline is established on this dataset using state-of-the-art semantic segmentation models based on performance metrics including Intersection Over Union (IoU), Overall Accuracy (OA), and F1 score
Study of Jacobian Normalization for VTLN
The divergence of the theory and practice of vocal tract length normalization (VTLN) is addressed, with particular emphasis on the role of the Jacobian determinant. VTLN is placed in a Bayesian setting, which brings in the concept of a prior on the warping factor. The form of the prior, together with acoustic scaling and numerical conditioning are then discussed and evaluated. It is concluded that the Jacobian determinant is important in VTLN, especially for the high dimensional features used in HMM based speech synthesis, and difficulties normally associated with the Jacobian determinant can be attributed to prior and scaling
Urban Tree Species Classification Using Aerial Imagery
Urban trees help regulate temperature, reduce energy consumption, improve urban air quality, reduce wind speeds, and mitigating the urban heat island effect. Urban trees also play a key role in climate change mitigation and global warming by capturing and storing atmospheric carbon-dioxide which is the largest contributor to greenhouse gases. Automated tree detection and species classification using aerial imagery can be a powerful tool for sustainable forest and urban tree management. Hence, This study first offers a pipeline for generating labelled dataset of urban trees using Google Map's aerial images and then investigates how state of the art deep Convolutional Neural Network models such as VGG and ResNet handle the classification problem of urban tree aerial images under different parameters. Experimental results show our best model achieves an average accuracy of 60% over 6 tree species
- …