57 research outputs found

    Investigating supra-intelligibility aspects of speech

    Get PDF
    158 p.Synthetic and recorded speech form a great part of oureveryday listening experience, and much of our exposure tothese forms of speech occurs in potentially noisy settings such as on public transport, in the classroom or workplace, while driving, and in our homes. Optimising speech output to ensure that salient information is both correctly and effortlessly received is a main concern for the designers of applications that make use of the speech modality. Most of the focus in adapting speech output to challenging listening conditions has been on intelligibility, and specifically on enhancing intelligibility by modifying speech prior to presentation. However, the quality of the generated speech is not always satisfying for the recipient, which might lead to fatigue, or reluctance in using this communication modality. Consequently, a sole focus on intelligibility enhancement provides an incomplete picture of a listenerÂżs experience since the effect of modified or synthetic speech on other characteristics risks being ignored. These concerns motivate the study of 'supra-intelligibility' factors such as the additional cognitive demand that modified speech may well impose upon listeners, as well as quality, naturalness, distortion and pleasantness. This thesis reports on an investigation into two supra-intelligibility factors: listening effort and listener preferences. Differences in listening effort across four speech types (plain natural, Lombard, algorithmically-enhanced, and synthetic speech) were measured using existing methods, including pupillometry, subjective judgements, and intelligibility scores. To explore the effects of speech features on listener preferences, a new tool, SpeechAdjuster, was developed. SpeechAdjuster allows the manipulation of virtually any aspect of speech and supports the joint elicitation of listener preferences and intelligibility measures. The tool reverses the roles of listener and experimenter by allowing listeners direct control of speech characteristics in real-time. Several experiments to explore the effects of speech properties on listening preferences and intelligibility using SpeechAdjuster were conducted. Participants were permitted to change a speech feature during an open-ended adjustment phase, followed by a test phase in which they identified speech presented with the feature value selected at the end of the adjustment phase. Experiments with native normal-hearing listeners measured the consequences of allowing listeners to change speech rate, fundamental frequency, and other features which led to spectral energy redistribution. Speech stimuli were presented in both quiet and masked conditions. Results revealed that listeners prefer feature modifications similar to those observed in naturally modified speech in noise (Lombard speech). Further, Lombard speech required the least listening effort compared to either plain natural, algorithmically-enhanced, or synthetic speech. For stationary noise, as noise level increased listeners chose slower speech rates and flatter tilts compared to the original speech. Only the choice of fundamental frequency was not consistent with that observed in Lombard speech. It is possible that features such as fundamental frequency that talkers naturally modify are by-products of the speech type (e.g. hyperarticulated speech) and might not be advantageous for the listener.Findings suggest that listener preferences provide information about the processing of speech over and above that measured by intelligibility. One of the listenersÂż concerns was to maximise intelligibility. In noise, listeners preferred the feature values for which more information survived masking, choosing speech rates that led to a contrast with the modulation rate of the masker, or modifications that led to a shift of spectral energy concentration to higher frequencies compared to those of the masker. For all features being modified by listeners, preferences were evident even when intelligibility was at or close to ceiling levels. Such preferences might result from a desire to reduce the cognitive effort of understanding speech, or from a desire to reproduce the sound of typical speech features experienced in real-world noisy conditions, or to optimise the quality of the modified signal. Investigation of supra-intelligibility aspects of speech promises to improve the quality of speech enhancement algorithms, bringing with it the potential of reducing the effort of understanding artificially-modified or generated forms of speech

    Intelligibility model optimisation approaches for speech pre-enhancement

    Get PDF
    The goal of improving the intelligibility of broadcast speech is being met by a recent new direction in speech enhancement: near-end intelligibility enhancement. In contrast to the conventional speech enhancement approach that processes the corrupted speech at the receiver-side of the communication chain, the near-end intelligibility enhancement approach pre-processes the clean speech at the transmitter-side, i.e. before it is played into the environmental noise. In this work, we describe an optimisation-based approach to near-end intelligibility enhancement using models of speech intelligibility to improve the intelligibility of speech in noise. This thesis first presents a survey of speech intelligibility models and how the adverse acoustic conditions affect the intelligibility of speech. The purpose of this survey is to identify models that we can adopt in the design of the pre-enhancement system. Then, we investigate the strategies humans use to increase speech intelligibility in noise. We then relate human strategies to existing algorithms for near-end intelligibility enhancement. A closed-loop feedback approach to near-end intelligibility enhancement is then introduced. In this framework, speech modifications are guided by a model of intelligibility. For the closed-loop system to work, we develop a simple spectral modification strategy that modifies the first few coefficients of an auditory cepstral representation such as to maximise an intelligibility measure. We experiment with two contrasting measures of objective intelligibility. The first, as a baseline, is an audibility measure named 'glimpse proportion' that is computed as the proportion of the spectro-temporal representation of the speech signal that is free from masking. We then propose a discriminative intelligibility model, building on the principles of missing data speech recognition, to model the likelihood of specific phonetic confusions that may occur when speech is presented in noise. The discriminative intelligibility measure is computed using a statistical model of speech from the speaker that is to be enhanced. Interim results showed that, unlike the glimpse proportion based system, the discriminative based system did not improve intelligibility. We investigated the reason behind that and we found that the discriminative based system was not able to target the phonetic confusion with the fixed spectral shaping. To address that, we introduce a time-varying spectral modification. We also propose to perform the optimisation on a segment-by-segment basis which enables a robust solution against the fluctuating noise. We further combine our system with a noise-independent enhancement technique, i.e. dynamic range compression. We found significant improvement in non-stationary noise condition, but no significant differences to the state-of-the art system (spectral shaping and dynamic range compression) where found in stationary noise condition

    Intelligibility enhancement of synthetic speech in noise

    Get PDF
    EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing

    Adaptive post-filtering of speech in mobile communications

    Get PDF
    Puheen ehostusta tarvitaan kohinaisen puheen laadun ja ymmÀrrettÀvyyden parantamisessa. TÀssÀ työssÀ suunniteltiin matkapuhelimiin tarkoitettu jÀlkisuodatusalgoritmi. TÀmÀn jÀlkiprosessoinnin tarkoituksena oli korostaa joitakin taajuusalueita puheessa siten, ettÀ sen ymmÀrtÀminen olisi edelleen mahdollista hyvin kovassa kohinassa. JÀlkiprosessoinnin alussa soinnillisen puhekehyksen formanttitaajuudet haettiin tarkastelemalla sen LP-spektrissÀ olevia piikkejÀ. TÀmÀn jÀlkeen ensimmÀistÀ löydettyÀ formanttia vaimennettiin ja toista vahvistettiin. Ideana oli siirtÀÀ energiaa korkeammille taajuuksille, jossa kohinan energiataso olisi matalampi. Formanttisuotimen kertoimet optimoitiin kuuntelukokeen avulla ja sen mahdollinen kallistus kompensoitiin ensimmÀisen asteen alipÀÀstösuotimella. Lopullisen jÀlkisuotimen suorituskykyÀ tarkasteltiin sekÀ tutkimalla sen vaikutusta erilaisiin soinnillisiin ÀÀnteisiin ettÀ vertailemalla suodinta muihin jÀlkisuotimiin. Saatujen tulosten perusteella voitiin pÀÀtellÀ, ettÀ toteutettu menetelmÀ toimi halutulla tavalla ja onnistui parantamaan puheen ymmÀrrettÀvyyttÀ. Tarkasteluissa tuli kuitenkin ilmi myös yllÀttÀviÀ piirteitÀ, kuten formanttien siirtymisiÀ, jotka vaativat lisÀtutkimusta. Verrattuna muihin jÀlkisuodatussysteemeihin, jotka on suunniteltu toimimaan kovassa kohinassa, työssÀ kehitetyn algoritmin etuna ovat sen adaptiivisuus ja sÀÀdettÀvyys.Speech enhancement is needed to improve the quality and intelligibility of speech degraded by noise. In this thesis, a post-filtering approach for the mobile communication environment was designed. The purpose of this post-processing scheme was to enhance certain frequency regions of speech, so that when it was degraded with a very high level of noise, the speech could still be understood. The post-processing worked by locating the formants of a voiced speech frame by extracting the peaks of the LP spectrum. After this, the first formant was attenuated and the second one enhanced. The idea was to move energy to higher frequencies where the energy level of the noise was lower. The coefficients of the formant filter were optimized with informal listening tests, and the possible tilt of the filter was compensated with a first order low-pass filter. The performance of the post-processing algorithm was studied by analyzing its effects on different voiced sounds and by comparing the filter to other post-filters. It was concluded that the post-processing worked as intended and improved the intelligibility of speech. Some unexpected behavior, such as shifted formants, was also encountered and needs to be further studied. The advantages of this approach are its more adaptive and tunable structure compared to the other methods used for post-processing in high noise levels

    Influence of ear canal occlusion and air-conduction feedback on speech production in noise

    Get PDF
    Millions of workers are exposed to high noise levels on a daily basis. The primary concern for these individuals is the prevention of noise-induced hearing loss, which is typically accomplished by wearing of some type of personal hearing protector. However, many workers complain they cannot adequately hear their co-workers when hearing protectors are worn. There are many aspects related to fully understanding verbal communication between noise-exposed workers that are wearing hearing protection. One topic that has received limited attention is the overall voice level a person uses to communicate in a noisy environment. Quantifying this component provides a starting point for understanding how communication may be improved in such situations. While blocking out external sounds, hearing protectors also induce changes in the wearer’s self-perception of his/her own voice, which is known as the occlusion effect. The occlusion effect and attenuation provided by hearing protectors generally produce opposite effects on that individual’s vocal output. A controlled laboratory study was devised to systematically examine the effect on a talker’s voice level caused by wearing a hearing protector and while being subjected to high noise levels. To test whether differences between occluded and unoccluded vocal characteristics are due solely to the occlusion effect, speech produced while subjects’ ear canals were occluded was measured without the subject effectively receiving any attenuation from the hearing protectors. To test whether vocal output differences are due to the reduction in the talker’s self-perceived voice level, the amount of occlusion was held constant while varying the effective hearing protector attenuation. Results show the occlusion effect, hearing protector attenuation, and ambient noise level all to have an effect on the talker’s voice output level, and all three must be known to fully understand and/or predict the effect in a particular situation. The results of this study may be used to begin an effort to quantify metrics in addition to the basic noise reduction rating that may be used to evaluate a hearing protector’s practical usability/wearability. By developing such performance metrics, workers will have information to make informed decisions about which hearing protector they should use for their particular work environment

    Conversation in small groups: Speaking and listening strategies depend on the complexities of the environment and group

    Get PDF
    Many conversations in our day-to-day lives are held in noisy environments, impeding comprehension, and in groups, taxing auditory attention-switching processes. These situations are particularly challenging for older adults in cognitive and sensory decline. In such complex environments, a variety of extra-linguistic strategies are available to speakers and listeners to facilitate communication, but while models of language account for the impact of context on word choice, there has been little consideration of the impact of context on extra-linguistic behaviour. To address this issue, we investigate how the complexity of the acoustic environment and interaction situation impacts extra-linguistic conversation behaviour of older adults during face-to-face conversations. Specifically, we test whether the use of intelligibility-optimising strategies increases with complexity of the background noise (from quiet to loud, and in speech-shaped vs babble noise), and with complexity of the conversing group (dyad vs triad). While some communication strategies are enhanced in more complex background noise, with listeners orienting to talkers more optimally and moving closer to their partner in babble than speech-shaped noise, this is not the case of all strategies, as we find greater vocal level increases in the less complex speech-shaped noise condition. Other behaviours are enhanced in the more complex interaction situation, with listeners showing more optimal head orientation and taking longer turns when gaining the floor in triads compared to dyads. This study elucidates how different features of the conversation context impact individuals’ communication strategies, which is necessary to both develop a comprehensive cognitive model of multimodal conversation behaviour, and to effectively support individuals that struggle conversing

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Context-aware speech synthesis: A human-inspired model for monitoring and adapting synthetic speech

    Get PDF
    The aim of this PhD thesis is to illustrate the development a computational model for speech synthesis, which mimics the behaviour of human speaker when they adapt their production to their communicative conditions. The PhD project was motivated by the observed differences between state-of-the- art synthesiser’s speech and human production. In particular, synthesiser outcome does not exhibit any adaptation to communicative context such as environmental disturbances, listener’s needs, or speech content meanings, as the human speech does. No evaluation is performed by standard synthesisers to check whether their production is suitable for the communication requirements. Inspired by Lindblom's Hyper and Hypo articulation theory (H&H) theory of speech production, the computational model of Hyper and Hypo articulation theory (C2H) is proposed. This novel computational model for automatic speech production is designed to monitor its outcome and to be able to control the effort involved in the synthetic speech generation. Speech transformations are based on the hypothesis that low-effort attractors for a human speech production system can be identified. Such acoustic configurations are close to minimum possible effort that a speaker can make in speech production. The interpolation/extrapolation along the key dimension of hypo/hyper-articulation can be motivated by energetic considerations of phonetic contrast. The complete reactive speech synthesis is enabled by adding a negative perception feedback loop to the speech production chain in order to constantly assess the communicative effectiveness of the proposed adaptation. The distance to the original communicative intents is the control signal that drives the speech transformations. A hidden Markov model (HMM)-based speech synthesiser along with the continuous adaptation of its statistical models is used to implement the C2H model. A standard version of the synthesis software does not allow for transformations of speech during the parameter generation. Therefore, the generation algorithm of one the most well-known speech synthesis frameworks, HMM/DNN-based speech synthesis framework (HTS), is modified. The short-time implementation of speech intelligibility index (SII), named extended speech intelligibility index (eSII), is also chosen as the main perception measure in the feedback loop to control the transformation. The effectiveness of the proposed model is tested by performing acoustic analysis, objective, and subjective evaluations. A key assessment is to measure the control of the speech clarity in noisy condition, and the similarities between the emerging modifications and human behaviour. Two objective scoring methods are used to assess the speech intelligibility of the implemented system: the speech intelligibility index (SII) and the index based upon the Dau measure (Dau). Results indicate that the intelligibility of C2H-generated speech can be continuously controlled. The effectiveness of reactive speech synthesis and of the phonetic contrast motivated transforms is confirmed by the acoustic and objective results. More precisely, in the maximum-strength hyper-articulation transformations, the improvement with respect to non-adapted speech is above 10% for all intelligibility indices and tested noise conditions
    • 

    corecore