897 research outputs found
The limits of the Mean Opinion Score for speech synthesis evaluation
The release of WaveNet and Tacotron has forever transformed the speech synthesis landscape. Thanks to these game-changing innovations, the quality of synthetic speech has reached unprecedented levels. However, to measure this leap in quality, an overwhelming majority of studies still rely on the Absolute Category Rating (ACR) protocol and compare systems using its output; the Mean Opinion Score (MOS). This protocol is not without controversy, and as the current state-of-the-art synthesis systems now produce outputs remarkably close to human speech, it is now vital to determine how reliable this score is.To do so, we conducted a series of four experiments replicating and following the 2013 edition of the Blizzard Challenge. With these experiments, we asked four questions about the MOS: How stable is the MOS of a system across time? How do the scores of lower quality systems influence the MOS of higher quality systems? How does the introduction of modern technologies influence the scores of past systems? How does the MOS of modern technologies evolve in isolation?The results of our experiments are manyfold. Firstly, we verify the superiority of modern technologies in comparison to historical synthesis. Then, we show that despite its origin as an absolute category rating, MOS is a relative score. While minimal variations are observed during the replication of the 2013-EH2 task, these variations can still lead to different conclusions for the intermediate systems. Our experiments also illustrate the sensitivity of MOS to the presence/absence of lower and higher anchors. Overall, our experiments suggest that we may have reached the end of a cul-de-sac by only evaluating the overall quality with MOS. We must embark on a new road and develop different evaluation protocols better suited to the analysis of modern speech synthesis technologies
ACOUSTIC SPEECH MARKERS FOR TRACKING CHANGES IN HYPOKINETIC DYSARTHRIA ASSOCIATED WITH PARKINSON’S DISEASE
Previous research has identified certain overarching features of hypokinetic dysarthria
associated with Parkinson’s Disease and found it manifests differently between
individuals. Acoustic analysis has often been used to find correlates of perceptual
features for differential diagnosis. However, acoustic parameters that are robust for
differential diagnosis may not be sensitive to tracking speech changes. Previous
longitudinal studies have had limited sample sizes or variable lengths between data
collection. This study focused on using acoustic correlates of perceptual features to
identify acoustic markers able to track speech changes in people with Parkinson’s
Disease (PwPD) over six months. The thesis presents how this study has addressed
limitations of previous studies to make a novel contribution to current knowledge.
Speech data was collected from 63 PwPD and 47 control speakers using an online
podcast software at two time points, six months apart (T1 and T2). Recordings of a
standard reading passage, minimal pairs, sustained phonation, and spontaneous speech
were collected. Perceptual severity ratings were given by two speech and language
therapists for T1 and T2, and acoustic parameters of voice, articulation and prosody
were investigated. Two analyses were conducted: a) to identify which acoustic
parameters can track perceptual speech changes over time and b) to identify which
acoustic parameters can track changes in speech intelligibility over time. An additional
attempt was made to identify if these parameters showed group differences for
differential diagnosis between PwPD and control speakers at T1 and T2.
Results showed that specific acoustic parameters in voice quality, articulation and
prosody could differentiate between PwPD and controls, or detect speech changes
between T1 and T2, but not both factors. However, specific acoustic parameters within
articulation could detect significant group and speech change differences across T1 and
T2. The thesis discusses these results, their implications, and the potential for future
studies
Impact of Masker Type and Reverberation on Pupillary Response and Listening Effort
The thesis explores the concept of listening effort by investigating changes in pupil dilation during a speech-in-noise test with six different listening conditions. The changes in pupil size have been captured with an eye tracking device. The test group consisted of 20 volunteers. The results of the listening test revealed that maskers and reverberation had a detrimental effect on speech intelligibility. Mean and peak pupil dilation measurements for anechoic conditions displayed similar patterns, with larger pupil sizes observed for masker types with higher speech recognition thresholds, indicating increased listening effort. The impact of reverberation varied depending on the noise type.
This thesis, along with previous studies, highlights the potential of pupillometry as a relevant tool providing an insight into speech processing difficulties not captured by standard diagnostic methods. It suggests that pupillometry could complement existing practices and methods in hearing evaluation. However, further research and the development of detailed guidelines for pupil data pre-processing are necessary to enhance the reliability of pupillometry in clinical settings. By doing so, this method could contribute to a better understanding of hearing challenges faced by patients on a daily basis
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
In addition to conveying the linguistic content from source speech to
converted speech, maintaining the speaking style of source speech also plays an
important role in the voice conversion (VC) task, which is essential in many
scenarios with highly expressive source speech, such as dubbing and data
augmentation. Previous work generally took explicit prosodic features or
fixed-length style embedding extracted from source speech to model the speaking
style of source speech, which is insufficient to achieve comprehensive style
modeling and target speaker timbre preservation. Inspired by the style's
multi-scale nature of human speech, a multi-scale style modeling method for the
VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the
speaking style of source speech from different levels. To effectively convey
the speaking style and meanwhile prevent timbre leakage from source speech to
converted speech, each level's style is modeled by specific representation.
Specifically, prosodic features, pre-trained ASR model's bottleneck features,
and features extracted by a model trained with a self-supervised strategy are
adopted to model the frame, local, and global-level styles, respectively.
Besides, to balance the performance of source style modeling and target speaker
timbre preservation, an explicit constraint module consisting of a pre-trained
speech emotion recognition model and a speaker classifier is introduced to
MSM-VC. This explicit constraint module also makes it possible to simulate the
style transfer inference process during the training to improve the
disentanglement ability and alleviate the mismatch between training and
inference. Experiments performed on the highly expressive speech corpus
demonstrate that MSM-VC is superior to the state-of-the-art VC methods for
modeling source speech style while maintaining good speech quality and speaker
similarity.Comment: This work was submitted on April 10, 2022 and accepted on August 29,
202
The impact of speech type on listening effort and intelligibility for native and non-native listeners
Listeners are routinely exposed to many different types of speech, including artificially-enhanced and synthetic speech, styles which deviate to a greater or lesser extent from naturally-spoken exemplars. While the impact of differing speech types on intelligibility is well-studied, it is less clear how such types affect cognitive processing demands, and in particular whether those speech forms with the greatest intelligibility in noise have a commensurately lower listening effort. The current study measured intelligibility, self-reported listening effort, and a pupillometry-based measure of cognitive load for four distinct types of speech: (i) plain i.e. natural unmodified speech; (ii) Lombard speech, a naturally-enhanced form which occurs when speaking in the presence of noise; (iii) artificially-enhanced speech which involves spectral shaping and dynamic range compression; and (iv) speech synthesized from text. In the first experiment a cohort of 26 native listeners responded to the four speech types in three levels of speech-shaped noise. In a second experiment, 31 non-native listeners underwent the same procedure at more favorable signal-to-noise ratios, chosen since second language listening in noise has a more detrimental effect on intelligibility than listening in a first language. For both native and non-native listeners, artificially-enhanced speech was the most intelligible and led to the lowest subjective effort ratings, while the reverse was true for synthetic speech. However, pupil data suggested that Lombard speech elicited the lowest processing demands overall. These outcomes indicate that the relationship between intelligibility and cognitive processing demands is not a simple inverse, but is mediated by speech type. The findings of the current study motivate the search for speech modification algorithms that are optimized for both intelligibility and listening effort.</p
Audiovisual speech perception in cochlear implant patients
Hearing with a cochlear implant (CI) is very different compared to a normal-hearing (NH) experience, as the CI can only provide limited auditory input. Nevertheless, the central auditory system is capable of learning how to interpret such limited auditory input such that it can extract meaningful information within a few months after
implant switch-on. The capacity of the auditory cortex to adapt to new auditory stimuli is an example of intra-modal plasticity — changes within a sensory cortical region as a result of altered statistics of the respective sensory input. However, hearing deprivation before implantation and restoration of hearing capacities after implantation can also induce cross-modal plasticity — changes within a sensory cortical region as a result of altered statistics of a different sensory input. Thereby, a preserved cortical region can, for example, support a deprived cortical region, as in the case of CI users which have been shown to exhibit cross-modal visual-cortex activation for purely auditory stimuli. Before implantation, during the period of hearing deprivation, CI users typically rely on additional visual cues like lip-movements for understanding speech. Therefore, it has been suggested that CI users show a pronounced binding of the auditory and visual systems, which may allow them to integrate auditory and visual speech information more efficiently. The projects included in this thesis investigate auditory, and particularly audiovisual speech processing in CI users. Four event-related potential (ERP) studies approach the matter from different perspectives, each with a distinct focus.
The first project investigates how audiovisually presented syllables are processed by CI users with bilateral hearing loss compared to NH controls. Previous ERP studies employing non-linguistic stimuli and studies using different neuroimaging techniques found distinct audiovisual interactions in CI users. However, the precise timecourse
of cross-modal visual-cortex recruitment and enhanced audiovisual interaction for speech related stimuli is unknown. With our ERP study we fill this gap, and we present differences in the timecourse of audiovisual interactions as well as in cortical source configurations between CI users and NH controls.
The second study focuses on auditory processing in single-sided deaf (SSD) CI users. SSD CI patients experience a maximally asymmetric hearing condition, as they have a CI on one ear and a contralateral NH ear. Despite the intact ear, several behavioural studies have demonstrated a variety of beneficial effects of restoring binaural hearing, but there are only few ERP studies which investigate auditory processing in SSD CI users. Our study investigates whether the side of implantation affects auditory processing and whether auditory processing via the NH ear of SSD CI users works similarly as in NH controls.
Given the distinct hearing conditions of SSD CI users, the question arises whether there are any quantifiable differences between CI user with unilateral hearing loss and bilateral hearing loss. In general, ERP studies on SSD CI users are rather scarce, and there is no study on audiovisual processing in particular. Furthermore, there are no reports on lip-reading abilities of SSD CI users. To this end, in the third project we extend the first study by including SSD CI users as a third experimental group. The study discusses both differences and similarities between CI users with bilateral hearing loss and CI users with unilateral hearing loss as well as NH controls and provides — for the first time — insights into audiovisual interactions in SSD CI users.
The fourth project investigates the influence of background noise on audiovisual interactions in CI users and whether a noise-reduction algorithm can modulate these interactions. It is known that in environments with competing background noise listeners generally rely more strongly on visual cues for understanding speech and that such situations are particularly difficult for CI users. As shown in previous auditory behavioural studies, the recently introduced noise-reduction algorithm "ForwardFocus" can be a useful aid in such cases. However, the questions whether employing the algorithm is beneficial in audiovisual conditions as well and whether using the algorithm has a measurable effect on cortical processing have not been investigated yet. In this ERP study, we address these questions with an auditory and audiovisual syllable discrimination task.
Taken together, the projects included in this thesis contribute to a better understanding of auditory and especially audiovisual speech processing in CI users, revealing distinct processing strategies employed to overcome the limited input provided by a CI. The results have clinical implications, as they suggest that clinical hearing assessments, which are currently purely auditory, should be extended to audiovisual assessments. Furthermore, they imply that rehabilitation including audiovisual training methods may be beneficial for all CI user groups for quickly achieving the most effective CI implantation outcome
Perceptually Motivated, Intelligent Audio Mixing Approaches for Hearing Loss
The growing population of listeners with hearing loss, along with the limitations of current audio enhancement solutions, have created the need for novel approaches that take into consideration the perceptual aspects of hearing loss, while taking advantage of the benefits produced by intelligent audio mixing. The aim of this thesis is to explore perceptually motivated intelligent approaches to audio mixing for listeners with hearing loss, through the development of a hearing loss simulation and its use as a referencing tool in automatic audio mixing. To achieve this aim, a real-time hearing loss simulation was designed and tested for its accuracy and effectiveness through the conduction of listening studies with participants with real and simulated hearing loss. The simulation was then used by audio engineering students and professionals during mixing, in order to provide information on the techniques and practices used by engineers to combat the effects of hearing loss while mixing content through the simulation. The extracted practices were then used to inform the following automatic mixing approaches: a deep learning approach utilising a differentiable digital signal processing architecture, a knowledge-based approach to gain mixing utilising fuzzy logic, a genetic algorithm approach to equalisation and finally a combined system of the fuzzy mixer and genetic equaliser. The outputs of all four systems were analysed, and each approach’s strengths and weaknesses were discussed in the thesis. The results of this work present the potential of integrating perceptual information into intelligent audio mixing production for hearing loss, paving the way for further exploration of this approach’s capabilities
- …