11 research outputs found

    Spectral discontinuity in concatenative speech synthesis – perception, join costs and feature transformations

    Get PDF
    This thesis explores the problem of determining an objective measure to represent human perception of spectral discontinuity in concatenative speech synthesis. Such measures are used as join costs to quantify the compatibility of speech units for concatenation in unit selection synthesis. No previous study has reported a spectral measure that satisfactorily correlates with human perception of discontinuity. An analysis of the limitations of existing measures and our understanding of the human auditory system were used to guide the strategies adopted to advance a solution to this problem. A listening experiment was conducted using a database of concatenated speech with results indicating the perceived continuity of each concatenation. The results of this experiment were used to correlate proposed measures of spectral continuity with the perceptual results. A number of standard speech parametrisations and distance measures were tested as measures of spectral continuity and analysed to identify their limitations. Time-frequency resolution was found to limit the performance of standard speech parametrisations.As a solution to this problem, measures of continuity based on the wavelet transform were proposed and tested, as wavelets offer superior time-frequency resolution to standard spectral measures. A further limitation of standard speech parametrisations is that they are typically computed from the magnitude spectrum. However, the auditory system combines information relating to the magnitude spectrum, phase spectrum and spectral dynamics. The potential of phase and spectral dynamics as measures of spectral continuity were investigated. One widely adopted approach to detecting discontinuities is to compute the Euclidean distance between feature vectors about the join in concatenated speech. The detection of an auditory event, such as the detection of a discontinuity, involves processing high up the auditory pathway in the central auditory system. The basic Euclidean distance cannot model such behaviour. A study was conducted to investigate feature transformations with sufficient processing complexity to mimic high level auditory processing. Neural networks and principal component analysis were investigated as feature transformations. Wavelet based measures were found to outperform all measures of continuity based on standard speech parametrisations. Phase and spectral dynamics based measures were found to correlate with human perception of discontinuity in the test database, although neither measure was found to contribute a significant increase in performance when combined with standard measures of continuity. Neural network feature transformations were found to significantly outperform all other measures tested in this study, producing correlations with perceptual results in excess of 90%

    Acoustic model selection for recognition of regional accented speech

    Get PDF
    Accent is cited as an issue for speech recognition systems. Our experiments showed that the ASR word error rate is up to seven times greater for accented speech compared with standard British English. The main objective of this research is to develop Automatic Speech Recognition (ASR) techniques that are robust to accent variation. We applied different acoustic modelling techniques to compensate for the effects of regional accents on the ASR performance. For conventional GMM-HMM based ASR systems, we showed that using a small amount of data from a test speaker to choose an accent dependent model using an accent identification system, or building a model using the data from N neighbouring speakers in AID space, will result in superior performance compared to that obtained with unsupervised or supervised speaker adaptation. In addition we showed that using a DNN-HMM rather than a GMM-HMM based acoustic model would improve the recognition accuracy considerably. Even if we apply two stages of accent followed by speaker adaptation to the GMM-HMM baseline system, the GMM-HMM based system will not outperform the baseline DNN-HMM based system. For more contemporary DNN-HMM based ASR systems we investigated how adding different types of accented data to the training set can provide better recognition accuracy on accented speech. Finally, we proposed a new approach for visualisation of the AID feature space. This is helpful in analysing the AID recognition accuracies and analysing AID confusion matrices

    Dysarthric speech analysis and automatic recognition using phase based representations

    Get PDF
    Dysarthria is a neurological speech impairment which usually results in the loss of motor speech control due to muscular atrophy and poor coordination of articulators. Dysarthric speech is more difficult to model with machine learning algorithms, due to inconsistencies in the acoustic signal and to limited amounts of training data. This study reports a new approach for the analysis and representation of dysarthric speech, and applies it to improve ASR performance. The Zeros of Z-Transform (ZZT) are investigated for dysarthric vowel segments. It shows evidence of a phase-based acoustic phenomenon that is responsible for the way the distribution of zero patterns relate to speech intelligibility. It is investigated whether such phase-based artefacts can be systematically exploited to understand their association with intelligibility. A metric based on the phase slope deviation (PSD) is introduced that are observed in the unwrapped phase spectrum of dysarthric vowel segments. The metric compares the differences between the slopes of dysarthric vowels and typical vowels. The PSD shows a strong and nearly linear correspondence with the intelligibility of the speaker, and it is shown to hold for two separate databases of dysarthric speakers. A systematic procedure for correcting the underlying phase deviations results in a significant improvement in ASR performance for speakers with severe and moderate dysarthria. In addition, information encoded in the phase component of the Fourier transform of dysarthric speech is exploited in the group delay spectrum. Its properties are found to represent disordered speech more effectively than the magnitude spectrum. Dysarthric ASR performance was significantly improved using phase-based cepstral features in comparison to the conventional MFCCs. A combined approach utilising the benefits of PSD corrections and phase-based features was found to surpass all the previous performance on the UASPEECH database of dysarthric speech

    Brains in dialogue: investigating accommodation in live conversational speech for both speech and EEG data.

    Get PDF
    One of the phenomena to emerge from the study of human spoken interaction is accommodation or the tendency of an individual’s speech patterning to shift relative to their interlocutor. Whilst the experimental approach to the detection of accommodation has a solid background in the literature, it tends to treat the process of accommodation as a black box. The general approach for the detection of accommodation in speech has been to record the speech of a given speaker prior to interaction and then again after an interaction. These two measures are then compared to the speech of the interlocutor to test for similarity. If the speech sample following interaction is more similar then we can say that accommodation has taken place. Part of the goal of this thesis is to evaluate whether it is possible to look into the black box of speech accommodation and measure it ‘in situ’. Given that speech accommodation appears to take place as a result of interaction, it would be reasonable to assume that a similar effect might be observable in other areas contributing to a communicative interaction. The notion of an interacting dyad developing an increased degree of alignment over the course of an interaction has been proposed by psychologists. Theories have posited that alignment occurs at multiple levels of engagement, from broad levels of syntactic alignment down to phonetic levels of alignment. The use of speech accommodation as an anchor with which to track the evolution of change in the brain signal may prove to be one approach to investigating the claims made by these theories. The second part of this thesis aims to evaluate whether the phenomenon of accommodation is also observable in the form of electrical signals generated by the brain, measured using Electroencephalography (EEG). However, evaluating the change in the EEG signal over a continuous stretch of time is a hurdle that will need to be tackled. Traditionally, EEG methodologies involve averaging the signal over many repetitions of the same task. This is not a viable option when investigating communicative interaction. Clearly the evaluation of accommodation in both speech and brain activity, especially for continuously unfolding phenomena such as accommodation, is a non-trivial task. In order to tackle this, an approach from speech recognition and computer science has been employed. The implementation of Hidden Markov Models (HMM) has been used to develop speech recognition systems and has also been used to detect fraudulent attempts to imitate the voice of others. Given that HMMs have successfully been employed to detect the imitation of another person’s speech they are a good candidate for being able to detect the movement towards or away from an interlocutor during the course of an interaction. In addition, the use of HMMs is non-domain specific, they can be used to evaluate any time-variant signal. This adaptability of the approach allows for it to also be applied to EEG signals in conjunction with the speech signal. Two experiments are presented here. The behavioural experiment aims to evaluate the ability of a HMM based approach to detect accommodation by engaging pairs of female, Glaswegian speakers in the collaborative DiapixUK task. The results of their interactions are then evaluated from both a traditional phonetic standpoint, by assessing changes in Voice Onset Time (VOT) of stop consonants, formant values of vowels and speech rate over the course of an interaction and using the HMM based approach. The neural experiment looks to evaluate the ability of a HMM based approach to detect accommodation in both the speech signal and in brain activity. The same experiment that was performed in Experiment 1 was repeated, with the addition of EEG caps to both participants. The data was then evaluated using the HMM based approach. This thesis presents findings that suggest a function for speech accommodation that has not been explored in the past. This is done through the use of a novel, HMM based, holistic acoustic-phonetic measurement tool which produced consistent measures across both experiments. Further to this, the measurement tool is shown to have possible extended uses for EEG data. The use of the presented HMM based, holistic-acoustic measurement tool presents a novel contribution to the field for the measurement and evaluation of accommodation

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Proceedings of the 19th Sound and Music Computing Conference

    Get PDF
    Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France). https://smc22.grame.f

    Improving Access and Mental Health for Youth Through Virtual Models of Care

    Get PDF
    The overall objective of this research is to evaluate the use of a mobile health smartphone application (app) to improve the mental health of youth between the ages of 14–25 years, with symptoms of anxiety/depression. This project includes 115 youth who are accessing outpatient mental health services at one of three hospitals and two community agencies. The youth and care providers are using eHealth technology to enhance care. The technology uses mobile questionnaires to help promote self-assessment and track changes to support the plan of care. The technology also allows secure virtual treatment visits that youth can participate in through mobile devices. This longitudinal study uses participatory action research with mixed methods. The majority of participants identified themselves as Caucasian (66.9%). Expectedly, the demographics revealed that Anxiety Disorders and Mood Disorders were highly prevalent within the sample (71.9% and 67.5% respectively). Findings from the qualitative summary established that both staff and youth found the software and platform beneficial

    The Impact of Digital Technologies on Public Health in Developed and Developing Countries

    Get PDF
    This open access book constitutes the refereed proceedings of the 18th International Conference on String Processing and Information Retrieval, ICOST 2020, held in Hammamet, Tunisia, in June 2020.* The 17 full papers and 23 short papers presented in this volume were carefully reviewed and selected from 49 submissions. They cover topics such as: IoT and AI solutions for e-health; biomedical and health informatics; behavior and activity monitoring; behavior and activity monitoring; and wellbeing technology. *This conference was held virtually due to the COVID-19 pandemic
    corecore